Low-Resource Languages: How NLLB and Aya Are Closing the Gap
Low-Resource Languages: How NLLB and Aya Are Closing the Gap
Of the world’s approximately 7,000 languages, commercial translation services like Google Translate and DeepL adequately serve perhaps 30-50. Billions of people speak languages that AI translation handles poorly or ignores entirely.
Two major projects are working to change this: Meta’s NLLB (No Language Left Behind) and Cohere for AI’s Aya initiative. This article examines what they have accomplished, how they work, and how far there is still to go.
Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.
The Low-Resource Problem
A language is considered “low-resource” in the context of machine translation when there is insufficient parallel text (translated sentence pairs) to train a high-quality translation model. The threshold is fuzzy, but roughly:
- High-resource: 10M+ parallel sentences (English, French, Spanish, German, Chinese)
- Medium-resource: 1M-10M parallel sentences (Korean, Thai, Vietnamese, Swahili)
- Low-resource: 100K-1M parallel sentences (Yoruba, Igbo, Nepali, Khmer)
- Very low-resource: Under 100K parallel sentences (most indigenous languages, many African and Asian languages)
The consequences of being low-resource are severe: speakers of these languages are excluded from the information economy, cannot access services in their language, and face barriers in education, healthcare, and governance.
NLLB-200: No Language Left Behind
Overview
NLLB-200 is Meta’s open-source translation model released in 2022, with ongoing improvements. It supports over 200 languages, making it the widest-coverage translation model available.
Key specs:
- Languages: 200+ (including many with fewer than 1 million speakers)
- Model sizes: 600M, 1.3B, 3.3B parameters
- Architecture: Encoder-decoder transformer (based on M2M-100)
- License: CC-BY-NC 4.0 (research use) / MIT (code)
- Training data: CCMatrix, CCAligned, OPUS, WikiMatrix, plus newly mined parallel data
How NLLB Works
NLLB uses several techniques to achieve broad language coverage:
1. Massively multilingual training: Rather than building separate models for each language pair, NLLB trains a single model on all 200+ languages simultaneously. This allows knowledge transfer — patterns learned from high-resource languages help improve translation for related low-resource languages.
2. Automated parallel data mining: NLLB’s team developed tools (LASER3, stopes) to automatically find parallel sentences across the web. By comparing sentence embeddings across languages, they identified translation pairs in web-crawled data that were previously undiscovered.
3. Language-specific data auditing: For each language, the team verified that training data was actually in the claimed language (a common problem with web-crawled data) and filtered out noise and misaligned pairs.
4. Spill-over prevention: In massively multilingual models, high-resource languages can dominate, degrading performance on low-resource languages. NLLB uses temperature-based sampling to balance training across languages.
NLLB Performance by Language Tier
| Tier | Example Languages | BLEU (EN→X) | Quality Assessment |
|---|---|---|---|
| High-resource | Spanish, French, German | 35-42 | Good but below DeepL/Google |
| Medium-resource | Swahili, Vietnamese, Ukrainian | 25-35 | Competitive with Google |
| Low-resource | Yoruba, Igbo, Lao | 15-25 | Best available option |
| Very low-resource | Twi, Mossi, Luganda | 10-18 | Functional but limited |
For high-resource languages, NLLB is behind commercial systems — but that is not its purpose. Its value is in the long tail of languages where no commercial system provides adequate coverage. NLLB-200 vs Google Translate: Accuracy by Language Pair
Practical Use of NLLB
NLLB is open-source and can be deployed locally:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
The model can be run on a single GPU for the smaller variants or on CPU with reduced throughput. How to Set Up NLLB-200 Locally: Tutorial
Aya: Multilingual LLM Approach
Overview
Aya is Cohere for AI’s open-science initiative to build multilingual language models. Unlike NLLB, which is a dedicated translation model, Aya is a general-purpose multilingual LLM that can perform translation alongside other tasks.
Key specs:
- Languages: 101 languages (Aya Expanse), 23 languages (Aya 23)
- Architecture: Decoder-only transformer
- Training data: Aya Dataset (human-curated multilingual instruction data) + Aya Collection (automated multilingual data)
- Key innovation: Community-driven data collection involving 3,000+ contributors from 119 countries
How Aya Differs from NLLB
| Aspect | NLLB-200 | Aya |
|---|---|---|
| Primary purpose | Translation | General multilingual AI |
| Architecture | Encoder-decoder | Decoder-only |
| Languages | 200+ | 101 (Expanse) |
| Translation approach | Direct translation model | Instruction-following LLM |
| Customization | Fine-tuning | Prompting + fine-tuning |
| Other capabilities | Translation only | QA, summarization, reasoning, etc. |
| Data approach | Automated mining | Community + automated |
Aya’s Strength: Contextual Translation
Because Aya is an instruction-following LLM, it can handle translation tasks that NLLB cannot:
- Translate with context: “Translate this legal term in the context of Nigerian law”
- Explain translations: “Translate this sentence and explain why you chose that word”
- Adapt register: “Translate this into informal Nigerian Pidgin”
- Handle ambiguity: “This sentence is ambiguous — provide translations for both interpretations”
For low-resource languages that Aya supports, this contextual capability can produce better translations than NLLB for complex or ambiguous content, even if NLLB’s raw translation quality is comparable on simple sentences.
Aya Model: 101-Language Translation Review
Other Projects Closing the Gap
Masakhane
A grassroots research community focused on NLP for African languages. Masakhane has produced translation models, datasets, and benchmarks for dozens of African languages. Their community-driven approach ensures that language speakers are involved in data creation and evaluation.
AmericasNLP
A research workshop and community focused on NLP for indigenous languages of the Americas. They organize shared tasks for machine translation of languages like Quechua, Guarani, Aymara, and Nahuatl.
OPUS-MT / Helsinki-NLP
The University of Helsinki maintains OPUS-MT, a collection of open-source translation models covering over 1,000 language pairs. While individual model quality varies, the breadth of coverage is valuable for low-resource pairs.
Google’s 1,000-Language Initiative
Google has announced a goal of building AI models that support 1,000 languages. Their Universal Speech Model and PaLM 2 efforts have expanded language coverage, though much of this work remains proprietary.
Challenges That Remain
Data Quality vs. Quantity
For low-resource languages, the available parallel data is often noisy — misaligned sentences, incorrect language labels, and low-quality translations. Simply having more data does not help if the data is unreliable. NLLB’s data auditing efforts partially address this, but it remains a fundamental challenge.
Evaluation Difficulty
How do you know if a translation into Yoruba or Lao is good? Automated metrics like BLEU require reference translations, which are scarce for low-resource languages. Human evaluation requires native speakers with translation expertise, who may be difficult to find and compensate fairly.
Dialect and Variety
Many “languages” encompass significant dialectal variation. “Arabic” includes dozens of regional varieties. “Chinese” includes Mandarin, Cantonese, and many others. Most translation systems target the standard/written variety, leaving speakers of other varieties poorly served.
Script and Encoding Issues
Some low-resource languages use scripts with incomplete Unicode support, complex rendering requirements, or multiple orthographic conventions. These technical issues can cause problems in data processing, model training, and output rendering.
Sustainability
Research projects like NLLB and Aya produce models, but who maintains them? As languages evolve and new content types emerge, models need updating. Sustainable funding and community engagement are essential for long-term impact.
Ethical Concerns
There are legitimate concerns about AI systems for indigenous and minority languages:
- Who controls the data and the models?
- Are language communities consulted and compensated?
- Could translation systems be used for surveillance or cultural homogenization?
- Are errors in sensitive contexts (medical, legal) adequately communicated?
How to Use Low-Resource Translation Today
For Developers
- Start with NLLB-200 for the widest language coverage. How to Set Up NLLB-200 Locally: Tutorial
- Try Aya for languages it supports, especially when contextual understanding matters.
- Fall back to Google Translate for languages it covers but NLLB handles poorly.
- Always communicate quality expectations — let users know that translation quality varies by language.
For Organizations
- Identify your actual language needs — which low-resource languages do your users or customers speak?
- Test quality on representative content before deploying. Translation AI Playground: Compare Models Side-by-Side
- Combine AI with human review for anything important. Choosing a Translation Service: Human vs AI vs Hybrid
- Contribute back — if you create quality translations, consider contributing them to open datasets.
For Researchers
- Contribute to data collection efforts through Masakhane, AmericasNLP, or the Aya initiative.
- Build evaluation resources — reference translations and human evaluation protocols for underserved languages.
- Focus on real-world impact — work with communities to understand their actual translation needs.
Key Takeaways
- NLLB-200 is the most comprehensive translation model for low-resource languages, covering 200+ languages with open-source availability. Its strength is breadth of coverage.
- Aya brings contextual, instruction-following capabilities to multilingual AI, covering 101 languages with the ability to handle nuanced translation tasks.
- Despite progress, translation quality for most low-resource languages remains significantly below what is available for major languages. Data scarcity is the fundamental bottleneck.
- Community-driven efforts (Masakhane, Aya contributors, AmericasNLP) are essential because they bring language expertise that no amount of engineering can replace.
- Ethical considerations — community consent, data ownership, fair compensation — must be central to low-resource language technology development.
Next Steps
- Try NLLB-200: Set it up locally with our How to Set Up NLLB-200 Locally: Tutorial tutorial.
- Compare NLLB with alternatives: Read NLLB-200 vs Google Translate: Accuracy by Language Pair for a detailed comparison.
- Explore the Aya model: See our Aya Model: 101-Language Translation Review for a comprehensive review.
- Find the best tool for rare languages: Check Best Translation AI for Rare/Low-Resource Languages for recommendations.
- See all language pair rankings: Visit the Translation Accuracy Leaderboard by Language Pair.