Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained
Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.
Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained
When we say one translation system is “better” than another, what do we actually mean? How is translation quality measured, and how reliable are those measurements?
This guide explains the major metrics used to evaluate machine translation, their strengths and weaknesses, and how to interpret scores when comparing systems.
Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.
Why Measurement Matters
Translation quality measurement serves several purposes:
- Model development: Researchers need metrics to know if their changes improve the model.
- Provider comparison: Buyers need to compare translation services objectively.
- Quality assurance: Production systems need automated checks to catch regressions.
- Benchmark tracking: The field needs standard measurements to track progress over time.
Without reliable metrics, we are left with subjective impressions — which, while valuable, are difficult to scale and compare.
Automated Metrics
BLEU (Bilingual Evaluation Understudy)
BLEU is the most widely cited translation metric, introduced in 2002 by IBM researchers. It remains the standard reference point, despite well-known limitations.
How it works: BLEU compares the machine translation output against one or more human reference translations. It counts how many n-grams (sequences of 1, 2, 3, and 4 consecutive words) in the machine output also appear in the reference translations.
Score range: 0 to 100 (or 0 to 1, depending on the tool). Higher is better.
Interpreting BLEU scores:
| BLEU Score | General Quality Level |
|---|---|
| 50+ | Very high quality, often indistinguishable from human |
| 40-50 | High quality, understandable and largely accurate |
| 30-40 | Good quality for gisting, needs editing for professional use |
| 20-30 | Functional but with noticeable errors |
| 10-20 | Low quality, may miss or distort meaning |
| Below 10 | Very low quality, often unusable |
Important caveats:
- These thresholds vary significantly by language pair. A BLEU of 30 for English-German means something very different from a BLEU of 30 for English-Japanese.
- BLEU penalizes valid alternative translations. If the reference says “the car is red” and the system produces “the automobile is red,” BLEU penalizes the mismatch even though the meaning is identical.
- BLEU does not account for fluency or meaning preservation directly — it is purely an n-gram overlap measure.
- Different BLEU implementations (SacreBLEU, NLTK BLEU, Moses BLEU) can produce different scores for the same output, making cross-paper comparisons unreliable unless the implementation is specified.
BLEU Score Calculator: Test Your Translation Quality
COMET (Crosslingual Optimized Metric for Evaluation of Translation)
COMET is a neural-based metric that has largely superseded BLEU as the preferred metric in recent research. It uses a trained model to predict human quality judgments.
How it works: COMET takes the source sentence, the machine translation, and (optionally) a reference translation. It encodes all three using a pre-trained multilingual language model and produces a score that correlates with human judgments of quality.
Score range: Typically 0 to 1, with higher being better. Scores above 0.85 generally indicate high-quality translations.
Advantages over BLEU:
- Much higher correlation with human judgments (the gold standard)
- Can evaluate translation quality even without a reference translation (QE variant)
- Captures meaning preservation and fluency, not just surface overlap
- Handles synonyms, paraphrases, and reordering better
Limitations:
- Requires running a neural model (computationally heavier than BLEU)
- The model itself may have biases from its training data
- Scores are harder to interpret intuitively than BLEU
- Different COMET model versions produce different scores
chrF (Character F-Score)
chrF measures character n-gram overlap between the machine translation and reference. It is particularly useful for morphologically rich languages where word-level metrics like BLEU struggle.
How it works: Instead of comparing words, chrF compares character sequences. This means it can give partial credit for morphological variants (“running” vs “ran”) and handles agglutinative languages better.
Advantages:
- No tokenization needed (language-agnostic)
- Better for morphologically rich languages (Turkish, Finnish, Hungarian)
- More robust than BLEU for short sentences
- Simple to compute
Limitations:
- Less interpretable than BLEU
- Does not capture word-level semantics well
- Can give high scores for translations that share many characters but differ in meaning
TER (Translation Edit Rate)
TER measures the minimum number of edits (insertions, deletions, substitutions, shifts) needed to transform the machine translation into the reference. Lower is better.
How it works: TER essentially measures how much a human post-editor would need to change the machine output to make it match the reference. This makes it directly relevant to MTPE (Machine Translation Post-Editing) workflows. Choosing a Translation Service: Human vs AI vs Hybrid
Score range: 0 to infinity (expressed as a ratio). A TER of 0 means the output matches the reference exactly. A TER of 0.5 means 50% of the words need editing.
Advantages:
- Directly relevant to post-editing cost estimation
- Intuitive interpretation (percentage of words that need changing)
- Accounts for word reordering (shifts)
Limitations:
- Like BLEU, penalizes valid alternative translations
- Does not capture fluency well
- Sensitive to reference translation choice
BERTScore
BERTScore uses contextual embeddings from BERT (or similar models) to compare the semantic similarity between machine translation and reference at the token level.
How it works: Each token in both the candidate and reference is represented by its contextual embedding. BERTScore computes precision, recall, and F1 based on the similarity of these embeddings.
Advantages:
- Captures semantic similarity rather than surface overlap
- Handles paraphrasing and synonyms well
- Language-agnostic (with multilingual BERT)
Limitations:
- Computationally expensive
- Dependent on the underlying embedding model
- Scores can be difficult to calibrate across languages
Human Evaluation Methods
Automated metrics are proxies for human judgment. When accuracy matters, human evaluation is essential.
Direct Assessment (DA)
How it works: Evaluators rate translations on a continuous scale (typically 0-100) for adequacy (does it convey the meaning?) and fluency (does it read naturally?).
Advantages: Simple, scalable, captures overall quality. Limitations: Subjective, requires clear guidelines, inter-annotator agreement varies.
MQM (Multidimensional Quality Metrics)
How it works: Evaluators identify and classify specific errors in translations. Each error is categorized (accuracy, fluency, terminology, style, etc.) and assigned a severity (critical, major, minor).
Advantages: Diagnostic — tells you not just that a translation is bad but why. Standardized error taxonomy allows comparison across evaluations. Limitations: Expensive, time-consuming, requires trained evaluators.
Ranking-Based Evaluation
How it works: Evaluators compare two or more translations of the same source sentence and rank them from best to worst.
Advantages: Easier than absolute scoring. Humans are better at relative judgments than absolute ones. Limitations: Does not tell you how good the best option is in absolute terms. Does not scale well to many systems.
Post-Editing Effort
How it works: Measure the time, keystrokes, or edit distance required for a human to post-edit machine translation output to acceptable quality.
Advantages: Directly relevant to production cost and efficiency. Limitations: Depends heavily on the post-editor’s skill and familiarity with the domain.
How We Evaluate at nllb.com
Our translation comparisons use a multi-metric approach:
- BLEU scores (SacreBLEU implementation) for comparability with published benchmarks.
- COMET scores (latest COMET model) as our primary automated quality indicator.
- Editorial evaluation by native speakers rating on a 1-10 scale for adequacy and fluency.
- Error annotation (MQM-inspired) for identifying systematic issues.
We report all metrics rather than relying on any single number, because no single metric captures the full picture.
Translation Accuracy Leaderboard by Language Pair
Common Pitfalls in Interpreting Metrics
1. Comparing BLEU Scores Across Language Pairs
A BLEU of 35 for English-Spanish is not the same quality level as a BLEU of 35 for English-Japanese. Language characteristics dramatically affect raw scores.
2. Confusing Correlation with Reliability
COMET correlates well with human judgments on average, but can be unreliable for individual sentences or unusual content types.
3. Ignoring Confidence Intervals
A BLEU difference of 0.5 points between two systems is usually not statistically significant. Always check whether differences are meaningful.
4. Cherry-Picking Examples
Any system can produce impressive individual translations. Quality should be judged on large, representative test sets.
5. Using Outdated Benchmarks
Translation quality improves over time. Results from 2023 may not reflect 2026 performance. Check when benchmarks were last updated.
6. Assuming Metrics Capture Everything
No metric captures cultural appropriateness, register accuracy, or domain-specific correctness well. These require human judgment.
Practical Guidelines
For Developers
- Use SacreBLEU for reproducible BLEU scores
- Add COMET as a primary quality metric
- Implement automated quality checks in your CI/CD pipeline
- Sample and manually review translations regularly
For Buyers
- Ask providers for metric scores on your specific language pairs and content types
- Request blind human evaluation as part of any pilot
- Do not rely solely on provider-supplied benchmarks
For Researchers
- Report multiple metrics (BLEU, COMET, chrF at minimum)
- Use SacreBLEU with documented parameters for reproducibility
- Include human evaluation for any major claims
- Report confidence intervals or significance tests
Key Takeaways
- BLEU is the most widely cited metric but has significant limitations — it penalizes valid alternatives and does not capture meaning preservation well.
- COMET is the current best automated metric, offering much higher correlation with human judgments, but it requires running a neural model.
- No single automated metric is sufficient. The best practice is to use multiple metrics and supplement with human evaluation.
- Human evaluation remains the gold standard but is expensive and subjective. MQM provides the most diagnostic feedback.
- When comparing systems, ensure metrics are computed consistently (same implementation, same test set, same references) and check for statistical significance.
Next Steps
- Calculate BLEU scores: Use our BLEU Score Calculator: Test Your Translation Quality to test your translations.
- See metrics in action: Check the Translation Accuracy Leaderboard by Language Pair for comparative scores across systems.
- Compare systems yourself: Use the Translation AI Playground: Compare Models Side-by-Side to run side-by-side comparisons.
- Learn how AI translation works: Read How AI Translation Works: Neural Machine Translation Explained for the technical foundation.
- Apply metrics to your evaluation: Use our framework in Enterprise Translation: How to Evaluate AI Translation Providers for systematic provider assessment.