Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.

Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained

Name: Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained
Creator: NLLB
Published: 2026-03-08
License: https://creativecommons.org/licenses/by-nc/4.0/

When we say one translation system is “better” than another, what do we actually mean? How is translation quality measured, and how reliable are those measurements?

This guide explains the major metrics used to evaluate machine translation, their strengths and weaknesses, and how to interpret scores when comparing systems.

Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.

Why Measurement Matters

Translation quality measurement serves several purposes:

Model development: Researchers need metrics to know if their changes improve the model.
Provider comparison: Buyers need to compare translation services objectively.
Quality assurance: Production systems need automated checks to catch regressions.
Benchmark tracking: The field needs standard measurements to track progress over time.

Without reliable metrics, we are left with subjective impressions — which, while valuable, are difficult to scale and compare.

Automated Metrics

BLEU (Bilingual Evaluation Understudy)

BLEU is the most widely cited translation metric, introduced in 2002 by IBM researchers. It remains the standard reference point, despite well-known limitations.

How it works: BLEU compares the machine translation output against one or more human reference translations. It counts how many n-grams (sequences of 1, 2, 3, and 4 consecutive words) in the machine output also appear in the reference translations.

Score range: 0 to 100 (or 0 to 1, depending on the tool). Higher is better.

Interpreting BLEU scores:

BLEU Score	General Quality Level
50+	Very high quality, often indistinguishable from human
40-50	High quality, understandable and largely accurate
30-40	Good quality for gisting, needs editing for professional use
20-30	Functional but with noticeable errors
10-20	Low quality, may miss or distort meaning
Below 10	Very low quality, often unusable

Important caveats:

These thresholds vary significantly by language pair. A BLEU of 30 for English-German means something very different from a BLEU of 30 for English-Japanese.
BLEU penalizes valid alternative translations. If the reference says “the car is red” and the system produces “the automobile is red,” BLEU penalizes the mismatch even though the meaning is identical.
BLEU does not account for fluency or meaning preservation directly — it is purely an n-gram overlap measure.
Different BLEU implementations (SacreBLEU, NLTK BLEU, Moses BLEU) can produce different scores for the same output, making cross-paper comparisons unreliable unless the implementation is specified.

BLEU Score Calculator: Test Your Translation Quality

COMET (Crosslingual Optimized Metric for Evaluation of Translation)

COMET is a neural-based metric that has largely superseded BLEU as the preferred metric in recent research. It uses a trained model to predict human quality judgments.

How it works: COMET takes the source sentence, the machine translation, and (optionally) a reference translation. It encodes all three using a pre-trained multilingual language model and produces a score that correlates with human judgments of quality.

Score range: Typically 0 to 1, with higher being better. Scores above 0.85 generally indicate high-quality translations.

Advantages over BLEU:

Much higher correlation with human judgments (the gold standard)
Can evaluate translation quality even without a reference translation (QE variant)
Captures meaning preservation and fluency, not just surface overlap
Handles synonyms, paraphrases, and reordering better

Limitations:

Requires running a neural model (computationally heavier than BLEU)
The model itself may have biases from its training data
Scores are harder to interpret intuitively than BLEU
Different COMET model versions produce different scores

chrF (Character F-Score)

chrF measures character n-gram overlap between the machine translation and reference. It is particularly useful for morphologically rich languages where word-level metrics like BLEU struggle.

How it works: Instead of comparing words, chrF compares character sequences. This means it can give partial credit for morphological variants (“running” vs “ran”) and handles agglutinative languages better.

Advantages:

No tokenization needed (language-agnostic)
Better for morphologically rich languages (Turkish, Finnish, Hungarian)
More robust than BLEU for short sentences
Simple to compute

Limitations:

Less interpretable than BLEU
Does not capture word-level semantics well
Can give high scores for translations that share many characters but differ in meaning

TER (Translation Edit Rate)

TER measures the minimum number of edits (insertions, deletions, substitutions, shifts) needed to transform the machine translation into the reference. Lower is better.

How it works: TER essentially measures how much a human post-editor would need to change the machine output to make it match the reference. This makes it directly relevant to MTPE (Machine Translation Post-Editing) workflows. Choosing a Translation Service: Human vs AI vs Hybrid

Score range: 0 to infinity (expressed as a ratio). A TER of 0 means the output matches the reference exactly. A TER of 0.5 means 50% of the words need editing.

Advantages:

Directly relevant to post-editing cost estimation
Intuitive interpretation (percentage of words that need changing)
Accounts for word reordering (shifts)

Limitations:

Like BLEU, penalizes valid alternative translations
Does not capture fluency well
Sensitive to reference translation choice

BERTScore

BERTScore uses contextual embeddings from BERT (or similar models) to compare the semantic similarity between machine translation and reference at the token level.

How it works: Each token in both the candidate and reference is represented by its contextual embedding. BERTScore computes precision, recall, and F1 based on the similarity of these embeddings.

Advantages:

Captures semantic similarity rather than surface overlap
Handles paraphrasing and synonyms well
Language-agnostic (with multilingual BERT)

Limitations:

Computationally expensive
Dependent on the underlying embedding model
Scores can be difficult to calibrate across languages

Human Evaluation Methods

Automated metrics are proxies for human judgment. When accuracy matters, human evaluation is essential.

Direct Assessment (DA)

How it works: Evaluators rate translations on a continuous scale (typically 0-100) for adequacy (does it convey the meaning?) and fluency (does it read naturally?).

Advantages: Simple, scalable, captures overall quality. Limitations: Subjective, requires clear guidelines, inter-annotator agreement varies.

MQM (Multidimensional Quality Metrics)

How it works: Evaluators identify and classify specific errors in translations. Each error is categorized (accuracy, fluency, terminology, style, etc.) and assigned a severity (critical, major, minor).

Advantages: Diagnostic — tells you not just that a translation is bad but why. Standardized error taxonomy allows comparison across evaluations. Limitations: Expensive, time-consuming, requires trained evaluators.

Ranking-Based Evaluation

How it works: Evaluators compare two or more translations of the same source sentence and rank them from best to worst.

Advantages: Easier than absolute scoring. Humans are better at relative judgments than absolute ones. Limitations: Does not tell you how good the best option is in absolute terms. Does not scale well to many systems.

Post-Editing Effort

How it works: Measure the time, keystrokes, or edit distance required for a human to post-edit machine translation output to acceptable quality.

Advantages: Directly relevant to production cost and efficiency. Limitations: Depends heavily on the post-editor’s skill and familiarity with the domain.

How We Evaluate at nllb.com

Our translation comparisons use a multi-metric approach:

BLEU scores (SacreBLEU implementation) for comparability with published benchmarks.
COMET scores (latest COMET model) as our primary automated quality indicator.
Editorial evaluation by native speakers rating on a 1-10 scale for adequacy and fluency.
Error annotation (MQM-inspired) for identifying systematic issues.

We report all metrics rather than relying on any single number, because no single metric captures the full picture.

Translation Accuracy Leaderboard by Language Pair

Common Pitfalls in Interpreting Metrics

1. Comparing BLEU Scores Across Language Pairs

A BLEU of 35 for English-Spanish is not the same quality level as a BLEU of 35 for English-Japanese. Language characteristics dramatically affect raw scores.

2. Confusing Correlation with Reliability

COMET correlates well with human judgments on average, but can be unreliable for individual sentences or unusual content types.

3. Ignoring Confidence Intervals

A BLEU difference of 0.5 points between two systems is usually not statistically significant. Always check whether differences are meaningful.

4. Cherry-Picking Examples

Any system can produce impressive individual translations. Quality should be judged on large, representative test sets.

5. Using Outdated Benchmarks

Translation quality improves over time. Results from 2023 may not reflect 2026 performance. Check when benchmarks were last updated.

6. Assuming Metrics Capture Everything

No metric captures cultural appropriateness, register accuracy, or domain-specific correctness well. These require human judgment.

Practical Guidelines

For Developers

Use SacreBLEU for reproducible BLEU scores
Add COMET as a primary quality metric
Implement automated quality checks in your CI/CD pipeline
Sample and manually review translations regularly

For Buyers

Ask providers for metric scores on your specific language pairs and content types
Request blind human evaluation as part of any pilot
Do not rely solely on provider-supplied benchmarks

For Researchers

Report multiple metrics (BLEU, COMET, chrF at minimum)
Use SacreBLEU with documented parameters for reproducibility
Include human evaluation for any major claims
Report confidence intervals or significance tests

Key Takeaways

BLEU is the most widely cited metric but has significant limitations — it penalizes valid alternatives and does not capture meaning preservation well.
COMET is the current best automated metric, offering much higher correlation with human judgments, but it requires running a neural model.
No single automated metric is sufficient. The best practice is to use multiple metrics and supplement with human evaluation.
Human evaluation remains the gold standard but is expensive and subjective. MQM provides the most diagnostic feedback.
When comparing systems, ensure metrics are computed consistently (same implementation, same test set, same references) and check for statistical significance.

Next Steps

Calculate BLEU scores: Use our BLEU Score Calculator: Test Your Translation Quality to test your translations.
See metrics in action: Check the Translation Accuracy Leaderboard by Language Pair for comparative scores across systems.
Compare systems yourself: Use the Translation AI Playground: Compare Models Side-by-Side to run side-by-side comparisons.
Learn how AI translation works: Read How AI Translation Works: Neural Machine Translation Explained for the technical foundation.
Apply metrics to your evaluation: Use our framework in Enterprise Translation: How to Evaluate AI Translation Providers for systematic provider assessment.

Sources

BLEU Score: Bilingual Evaluation Understudy — ACL Anthology — accessed March 26, 2026
Unbabel: Translation Quality Estimation — accessed March 26, 2026

Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained

Why Measurement Matters

Automated Metrics

BLEU (Bilingual Evaluation Understudy)

COMET (Crosslingual Optimized Metric for Evaluation of Translation)

chrF (Character F-Score)

TER (Translation Edit Rate)

BERTScore

Human Evaluation Methods

Direct Assessment (DA)

MQM (Multidimensional Quality Metrics)

Ranking-Based Evaluation

Post-Editing Effort

How We Evaluate at nllb.com

Common Pitfalls in Interpreting Metrics

1. Comparing BLEU Scores Across Language Pairs

2. Confusing Correlation with Reliability

3. Ignoring Confidence Intervals

4. Cherry-Picking Examples

5. Using Outdated Benchmarks

6. Assuming Metrics Capture Everything

Practical Guidelines

For Developers

For Buyers

For Researchers

Key Takeaways

Next Steps

Sources

More in Methodology