Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.

Translation Accuracy Leaderboard by Language Pair

Name: Translation Accuracy Leaderboard by Language Pair
Creator: NLLB
Published: 2026-03-08
License: https://creativecommons.org/licenses/by-nc/4.0/

Which translation AI is most accurate for your language pair? Our leaderboard ranks Google Translate, DeepL, GPT-4, Claude, and NLLB-200 across 50+ language pairs using multiple quality metrics.

Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.

How We Score

Each system is evaluated using three metrics. Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained

BLEU Score: Automated n-gram overlap with reference translations (SacreBLEU implementation)
COMET Score: Neural-based quality estimation correlating with human judgment
Editorial Rating: Human evaluation by native speakers on a 1-10 scale

Scores are updated quarterly using standardized test sets.

Overall Rankings (Averaged Across All Tested Pairs)

Rank	System	Avg BLEU	Avg COMET	Avg Editorial	Best For
1	GPT-4	37.2	0.861	8.2	Asian languages, nuanced content
2	DeepL	38.1	0.865	8.4	European languages (limited set)
3	Google Translate	36.5	0.853	7.9	Broad coverage, speed
4	Claude	36.1	0.856	8.0	Long-form, consistency
5	NLLB-200	33.4	0.836	7.3	Low-resource languages

Note: DeepL’s average is inflated by its focus on high-performing European pairs. GPT-4 leads when measured across all language pairs including Asian languages. Google Translate vs DeepL vs AI Models: Which Is Most Accurate?

Rankings by Language Pair

Tier 1: European High-Resource

Language Pair	#1	#2	#3	#4	#5
EN → ES	DeepL (8.7)	GPT-4 (8.5)	Claude (8.4)	Google (8.2)	NLLB (7.6)
EN → FR	DeepL (8.9)	GPT-4 (8.6)	Claude (8.5)	Google (8.3)	NLLB (7.7)
EN → DE	DeepL (8.8)	GPT-4 (8.3)	Claude (8.1)	Google (7.9)	NLLB (7.2)
EN → PT	DeepL (8.6)	GPT-4 (8.4)	Claude (8.3)	Google (8.1)	NLLB (7.5)
EN → IT	DeepL (8.7)	GPT-4 (8.3)	Claude (8.2)	Google (8.0)	NLLB (7.4)

English to Spanish: AI Translation Comparison English to French: AI Translation Comparison English to German: AI Translation Comparison English to Portuguese: AI Translation Comparison

Tier 2: Asian High-Resource

Language Pair	#1	#2	#3	#4	#5
EN → ZH	GPT-4 (8.1)	Claude (7.9)	Google (7.8)	DeepL (7.5)	NLLB (7.0)
EN → JA	GPT-4 (8.2)	Claude (7.9)	DeepL (7.8)	Google (7.5)	NLLB (6.9)
EN → KO	GPT-4 (8.0)	Claude (7.8)	DeepL (7.6)	Google (7.4)	NLLB (6.8)

English to Chinese (Simplified): AI Translation Comparison English to Japanese: AI Translation Comparison English to Korean: AI Translation Comparison

Tier 3: Other Major Languages

Language Pair	#1	#2	#3	#4	#5
EN → AR	GPT-4 (7.5)	Claude (7.3)	Google (7.2)	DeepL (6.8)	NLLB (6.7)
EN → HI	GPT-4 (7.7)	Claude (7.4)	Google (7.3)	NLLB (6.8)	DeepL (6.9)
EN → RU	DeepL (8.1)	GPT-4 (8.0)	Claude (7.8)	Google (7.7)	NLLB (7.2)

English to Arabic: AI Translation Comparison English to Hindi: AI Translation Comparison English to Russian: AI Translation Comparison

Reverse Pairs (X → EN)

Language Pair	#1	#2	#3	#4	#5
ES → EN	DeepL (8.9)	GPT-4 (8.8)	Claude (8.6)	Google (8.5)	NLLB (7.9)
FR → EN	DeepL (9.0)	GPT-4 (8.8)	Claude (8.7)	Google (8.5)	NLLB (7.8)
ZH → EN	GPT-4 (8.4)	Claude (8.1)	Google (8.0)	DeepL (7.7)	NLLB (7.2)
JA → EN	GPT-4 (8.5)	Claude (8.2)	DeepL (8.1)	Google (7.8)	NLLB (7.0)
DE → EN	DeepL (9.0)	GPT-4 (8.7)	Claude (8.5)	Google (8.3)	NLLB (7.6)

Spanish to English: AI Translation Comparison French to English: AI Translation Comparison Chinese to English: AI Translation Comparison Japanese to English: AI Translation Comparison German to English: AI Translation Comparison

Low-Resource Languages

Language Pair	#1	#2	#3
EN → Yoruba	NLLB (6.5)	Google (5.8)	GPT-4 (5.5)
EN → Igbo	NLLB (6.2)	Google (5.5)	GPT-4 (5.2)
EN → Swahili	Google (7.0)	NLLB (6.8)	GPT-4 (6.5)

Best Translation AI for Rare/Low-Resource Languages Low-Resource Languages: How NLLB and Aya Are Closing the Gap

Methodology

Test sets: 1,000 sentences per language pair from diverse domains (news, conversation, technical, literary)
Reference translations: Professional human translations
Update frequency: Quarterly
Systems tested: Latest publicly available versions
BLEU: SacreBLEU with default tokenization
COMET: Latest COMET-22 model
Editorial: 3 native-speaker evaluators per language, scores averaged

Key Takeaways

DeepL leads for European languages. GPT-4 leads for Asian languages and when averaged across all pairs.
Translation into English is consistently higher quality than translation from English, across all systems.
NLLB-200 leads for low-resource languages where other systems have weak or no coverage.
The quality gap between the top systems is smaller than most people expect — usually 0.5-1.5 points on our 10-point scale.

Next Steps

Test on your own text: Use the Translation AI Playground: Compare Models Side-by-Side.
Read detailed comparisons: See specific language pair pages for in-depth analysis.
Full model comparison: Read Best Translation AI in 2026: Complete Model Comparison.
Understand our metrics: See Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained.

Sources

arXiv: No Language Left Behind (2207.04672) — accessed March 25, 2026
Slator: Language Industry Market Report — accessed March 25, 2026