Methodology

Machine Translation Quality: BLEU Scores Explained

By Editorial Team Published

Last updated: March 2026

Machine Translation Quality: BLEU Scores Explained

When someone claims that one translation engine is “better” than another, what are they actually measuring? In most cases, the answer is BLEU — a metric introduced in 2002 that remains the most widely used measure of machine translation quality in 2026, despite well-known limitations and newer alternatives.

This guide explains how BLEU works, what scores actually mean, why the metric persists despite its flaws, and how modern alternatives like COMET and xCOMET are changing translation evaluation. Whether you are a developer building translation features, a buyer comparing services, or a researcher benchmarking models, understanding BLEU is foundational.

What is BLEU?

BLEU stands for Bilingual Evaluation Understudy. It was introduced by Kishore Papineni and colleagues at IBM Research in a 2002 paper presented at the Association for Computational Linguistics (ACL) conference. The core idea was to create an automatic evaluation method that could replace or supplement expensive human evaluation of translation quality.

The original paper — BLEU: a Method for Automatic Evaluation of Machine Translation — has become one of the most cited papers in computational linguistics, with over 20,000 citations as of 2026.

BLEU works by comparing a machine translation (the “candidate”) against one or more human translations (the “references”) and measuring how many word sequences overlap between them. The underlying assumption is that a good machine translation will share many word sequences with a high-quality human translation.

How BLEU Scoring Works

Step 1: N-gram Matching

BLEU counts matching sequences of words (n-grams) between the candidate and reference translations:

  • Unigrams (single words): Capture whether the right vocabulary is present. Reflect translation adequacy — are the right concepts mentioned?
  • Bigrams (two-word sequences): Capture some word order and collocation.
  • Trigrams (three-word sequences): Capture more phrase-level structure.
  • 4-grams (four-word sequences): Capture fluency and grammatical coherence.

Standard BLEU (BLEU-4) uses all four levels. Each level is scored separately, then combined.

Step 2: Modified Precision

For each n-gram level, BLEU calculates a modified precision score: the fraction of candidate n-grams that appear in the reference, with a clipping mechanism to prevent gaming. If the candidate repeats the same word 10 times but the reference only contains it twice, only two matches count.

Step 3: Brevity Penalty

If the candidate translation is shorter than the reference, BLEU applies a brevity penalty. This prevents systems from achieving high precision by producing very short translations that only contain high-confidence words. The penalty is exponential: translations much shorter than the reference are penalized heavily.

Step 4: Final Score

The final BLEU score is the geometric mean of the four n-gram precision scores, multiplied by the brevity penalty:

BLEU = BP x exp(average of log-precisions for 1-gram through 4-gram)

The result is a number between 0 and 1, though it is conventionally reported on a 0-100 scale for readability.

Interpreting BLEU Scores

What the Numbers Mean

BLEU scores are not percentages of “accuracy.” A BLEU score of 50 does not mean a translation is 50% correct. The scale is relative, and interpretation depends heavily on language pair and content domain.

General interpretation guidelines:

BLEU RangeQuality LevelPractical Meaning
< 10Almost uselessOutput is nearly incomprehensible
10-19Low qualityGets the gist across, many errors
20-29UnderstandableMeaning is clear, quality is rough
30-39GoodAdequate for many practical purposes
40-49High qualityMinor errors, generally fluent
50-59Very high qualityNear-professional translation
60+ExcellentApproaches human reference quality

These ranges are approximate and vary by language pair. A BLEU score of 40 for English-German is a very different achievement than a BLEU score of 40 for English-Japanese, because the languages differ in structure, word order, and morphological complexity.

Current Benchmark Scores (2026)

Top-performing systems on WMT24/25 benchmarks:

SystemEN→DEEN→FREN→ES
DeepL64.563.162.8
GPT-4o62.160.861.4
Google (Gemini)58.357.958.1
NLLB-200 (3.3B)42.143.844.2

For an interactive comparison of current systems, see our Translation Accuracy Leaderboard and BLEU Score Calculator.

The Limitations of BLEU

Despite its dominance, BLEU has well-documented problems that the MT community has debated for two decades.

1. It Penalizes Valid Paraphrases

If the reference says “The cat sat on the mat” and the candidate says “The feline was sitting on the rug,” BLEU gives a low score because the n-grams don’t match — even though the translation is perfectly correct. BLEU measures string similarity, not semantic similarity.

2. It Ignores Meaning

BLEU does not assess whether the translation conveys the correct meaning. A translation that uses similar words in a grammatically incorrect order could score higher than a correct translation that paraphrases. As the Machine Translate community wiki notes, BLEU “does not even try to measure translation quality” but rather focuses on string similarity to a single human reference.

3. Single-Reference Bias

Most BLEU evaluations use a single reference translation. But any sentence can be correctly translated in many different ways. Using multiple references helps but is expensive and still cannot cover the full space of valid translations.

4. Scores Are Not Comparable Across Language Pairs

An English-German BLEU score of 50 cannot be compared to a Japanese-English BLEU score of 50. Language pair difficulty, morphological complexity, and word order differences all affect the score range. Google’s own documentation explicitly warns against cross-language-pair comparisons.

5. It Doesn’t Capture Fluency Well

A translation can have high n-gram overlap with the reference but still read awkwardly. BLEU’s brevity penalty and n-gram matching provide only a rough proxy for fluency.

6. Gaming Potential

Systems can be optimized to maximize BLEU without improving actual translation quality. This “teaching to the test” problem has led some researchers to call BLEU scores an unreliable indicator of real-world performance.

Modern Alternatives to BLEU

The MT evaluation community has developed several metrics that address BLEU’s limitations. For a broader overview of all metrics, see our guide on Translation Quality Metrics.

COMET (Crosslingual Optimized Metric for Evaluation of Translation)

Developed by Unbabel, COMET is a neural metric that uses multilingual language models to evaluate translation quality. Unlike BLEU, it considers semantic meaning, not just string overlap.

Key advantages:

  • Correlates much better with human judgment than BLEU
  • Considers the source text, not just the reference
  • Captures meaning preservation even when wording differs

Performance: COMET-22 achieved 0.69 system-level correlation with human MQM ratings at WMT24, compared to roughly 0.55 for BLEU.

xCOMET

An extension of COMET that provides not only a score but also error span detection — it identifies which specific parts of a translation contain minor, major, or critical errors according to the MQM (Multidimensional Quality Metrics) typology.

Key advantage: Achieved approximately 0.72 system-level correlation at WMT24, the highest among automatic metrics. Combines sentence-level scoring with fine-grained error analysis.

MetricX

Google’s neural MT evaluation metric, offered through the Vertex AI platform. Designed for production use with strong correlation to human judgment. Can be deployed directly within Google Cloud translation pipelines.

METEOR

Addresses some BLEU limitations by incorporating synonym matching and stemming. More nuanced than BLEU but less widely adopted and still based on surface-level features.

Human Evaluation (MQM Framework)

The gold standard. Professional evaluators assess translations for adequacy (meaning preservation) and fluency (naturalness) using standardized error taxonomies. Expensive and slow but the most reliable measure of translation quality.

When to Use Which Metric

ScenarioRecommended MetricWhy
Published research papersBLEU + COMETBLEU for comparability with prior work, COMET for better signal
Production quality monitoringCOMET or xCOMETBetter correlation with actual quality
Quick development iterationBLEUFast, easy to compute, universally understood
High-stakes quality assessmentHuman MQM evaluationNothing substitutes human judgment for critical content
Error analysis and debuggingxCOMETPinpoints specific error spans
Vendor comparisonBLEU + COMET + human sampleMultiple metrics reduce the risk of one metric’s blind spots

How BLEU is Used in Practice

Model Development

Researchers report BLEU scores to demonstrate improvements. A new architecture or training technique is typically validated by showing BLEU improvement on standard test sets (WMT, FLORES, Tatoeba). Understanding BLEU is essential for following published research on how AI translation works.

Service Comparison

Buyers compare translation services partly on benchmark scores. Our comparison of Google Translate vs DeepL vs ChatGPT uses BLEU alongside human evaluation and other metrics.

Quality Assurance

Production systems use BLEU (or COMET) to detect regressions. If a model update causes BLEU scores to drop on a held-out test set, the update is rolled back. This is particularly important in enterprise translation environments.

Benchmark Tracking

Organizations like WMT (Workshop on Machine Translation) run annual evaluation campaigns that track translation quality improvements across the field. These campaigns use BLEU alongside human evaluation and newer metrics.

The Future of Translation Evaluation

BLEU is not going away soon. Its simplicity, speed, and universal recognition make it indispensable for quick comparisons. However, the field is clearly moving toward neural metrics:

  • Multi-metric reporting is becoming standard. Top venues now expect BLEU, COMET, and human evaluation together.
  • Reference-free metrics are emerging, which evaluate translations without needing a human reference at all — using the source text and the candidate only.
  • LLM-as-judge approaches, where large language models evaluate translation quality, show promise but raise questions about circular evaluation.

The organizations that invest in robust evaluation across multiple metrics will make better decisions about which translation tools and approaches serve their needs. Start with BLEU for baseline comparisons, use COMET/xCOMET for more reliable quality signals, and validate critical decisions with human evaluation.

FAQ

What is a good BLEU score? It depends on the language pair and domain. For high-resource European pairs, modern systems score 55-65. For distant language pairs (English-Japanese), scores of 30-40 are considered good. Above 60 for any language pair indicates excellent quality approaching human reference level.

Why is BLEU still used if it has so many problems? Three reasons: it is fast to compute, universally understood, and provides a consistent baseline for comparisons. Every MT paper from the last two decades reports BLEU, so it enables longitudinal comparison. Newer metrics are better but lack this universal adoption.

Can I calculate BLEU scores myself? Yes. Libraries like SacreBLEU (Python) provide standardized BLEU calculation. You need a machine translation output and one or more reference translations. Try our BLEU Score Calculator for a quick check.

How is COMET different from BLEU? COMET uses neural language models to assess semantic meaning, while BLEU only counts matching word sequences. COMET considers the source text, handles paraphrases correctly, and correlates much better with human quality judgments. See our full Translation Quality Metrics guide for details.

Do BLEU scores predict user satisfaction? Loosely. Higher BLEU scores generally correspond to better translations, but the correlation is imperfect. A translation with a BLEU of 40 might be perfectly adequate for one user’s needs while a score of 55 might still have errors that matter in a legal context. Use BLEU for relative comparison, not absolute quality prediction.

What is MQM and how does it relate to BLEU? MQM (Multidimensional Quality Metrics) is a human evaluation framework that classifies translation errors by type and severity. It is the most reliable quality measure but requires trained evaluators. BLEU attempts to approximate human judgment automatically. xCOMET bridges the gap by detecting MQM-style error categories automatically.


Sources: