Tools

Translation Accuracy Leaderboard by Language Pair

Updated 2026-03-10

Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.

Translation Accuracy Leaderboard by Language Pair

Which translation AI is most accurate for your language pair? Our leaderboard ranks Google Translate, DeepL, GPT-4, Claude, and NLLB-200 across 50+ language pairs using multiple quality metrics.

Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.

How We Score

Each system is evaluated using three metrics. Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained

  • BLEU Score: Automated n-gram overlap with reference translations (SacreBLEU implementation)
  • COMET Score: Neural-based quality estimation correlating with human judgment
  • Editorial Rating: Human evaluation by native speakers on a 1-10 scale

Scores are updated quarterly using standardized test sets.

Overall Rankings (Averaged Across All Tested Pairs)

RankSystemAvg BLEUAvg COMETAvg EditorialBest For
1GPT-437.20.8618.2Asian languages, nuanced content
2DeepL38.10.8658.4European languages (limited set)
3Google Translate36.50.8537.9Broad coverage, speed
4Claude36.10.8568.0Long-form, consistency
5NLLB-20033.40.8367.3Low-resource languages

Note: DeepL’s average is inflated by its focus on high-performing European pairs. GPT-4 leads when measured across all language pairs including Asian languages. Google Translate vs DeepL vs AI Models: Which Is Most Accurate?

Rankings by Language Pair

Tier 1: European High-Resource

Language Pair#1#2#3#4#5
EN → ESDeepL (8.7)GPT-4 (8.5)Claude (8.4)Google (8.2)NLLB (7.6)
EN → FRDeepL (8.9)GPT-4 (8.6)Claude (8.5)Google (8.3)NLLB (7.7)
EN → DEDeepL (8.8)GPT-4 (8.3)Claude (8.1)Google (7.9)NLLB (7.2)
EN → PTDeepL (8.6)GPT-4 (8.4)Claude (8.3)Google (8.1)NLLB (7.5)
EN → ITDeepL (8.7)GPT-4 (8.3)Claude (8.2)Google (8.0)NLLB (7.4)

English to Spanish: AI Translation Comparison English to French: AI Translation Comparison English to German: AI Translation Comparison English to Portuguese: AI Translation Comparison

Tier 2: Asian High-Resource

Language Pair#1#2#3#4#5
EN → ZHGPT-4 (8.1)Claude (7.9)Google (7.8)DeepL (7.5)NLLB (7.0)
EN → JAGPT-4 (8.2)Claude (7.9)DeepL (7.8)Google (7.5)NLLB (6.9)
EN → KOGPT-4 (8.0)Claude (7.8)DeepL (7.6)Google (7.4)NLLB (6.8)

English to Chinese (Simplified): AI Translation Comparison English to Japanese: AI Translation Comparison English to Korean: AI Translation Comparison

Tier 3: Other Major Languages

Language Pair#1#2#3#4#5
EN → ARGPT-4 (7.5)Claude (7.3)Google (7.2)DeepL (6.8)NLLB (6.7)
EN → HIGPT-4 (7.7)Claude (7.4)Google (7.3)NLLB (6.8)DeepL (6.9)
EN → RUDeepL (8.1)GPT-4 (8.0)Claude (7.8)Google (7.7)NLLB (7.2)

English to Arabic: AI Translation Comparison English to Hindi: AI Translation Comparison English to Russian: AI Translation Comparison

Reverse Pairs (X → EN)

Language Pair#1#2#3#4#5
ES → ENDeepL (8.9)GPT-4 (8.8)Claude (8.6)Google (8.5)NLLB (7.9)
FR → ENDeepL (9.0)GPT-4 (8.8)Claude (8.7)Google (8.5)NLLB (7.8)
ZH → ENGPT-4 (8.4)Claude (8.1)Google (8.0)DeepL (7.7)NLLB (7.2)
JA → ENGPT-4 (8.5)Claude (8.2)DeepL (8.1)Google (7.8)NLLB (7.0)
DE → ENDeepL (9.0)GPT-4 (8.7)Claude (8.5)Google (8.3)NLLB (7.6)

Spanish to English: AI Translation Comparison French to English: AI Translation Comparison Chinese to English: AI Translation Comparison Japanese to English: AI Translation Comparison German to English: AI Translation Comparison

Low-Resource Languages

Language Pair#1#2#3
EN → YorubaNLLB (6.5)Google (5.8)GPT-4 (5.5)
EN → IgboNLLB (6.2)Google (5.5)GPT-4 (5.2)
EN → SwahiliGoogle (7.0)NLLB (6.8)GPT-4 (6.5)

Best Translation AI for Rare/Low-Resource Languages Low-Resource Languages: How NLLB and Aya Are Closing the Gap

Methodology

  • Test sets: 1,000 sentences per language pair from diverse domains (news, conversation, technical, literary)
  • Reference translations: Professional human translations
  • Update frequency: Quarterly
  • Systems tested: Latest publicly available versions
  • BLEU: SacreBLEU with default tokenization
  • COMET: Latest COMET-22 model
  • Editorial: 3 native-speaker evaluators per language, scores averaged

Key Takeaways

  • DeepL leads for European languages. GPT-4 leads for Asian languages and when averaged across all pairs.
  • Translation into English is consistently higher quality than translation from English, across all systems.
  • NLLB-200 leads for low-resource languages where other systems have weak or no coverage.
  • The quality gap between the top systems is smaller than most people expect — usually 0.5-1.5 points on our 10-point scale.

Next Steps