How to Set Up NLLB-200 Locally: Tutorial
Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.
How to Set Up NLLB-200 Locally: Tutorial
NLLB-200 (No Language Left Behind) is Meta’s open-source translation model supporting 200+ languages. Running it locally gives you unlimited free translation with full data privacy. This tutorial walks you through setup from scratch.
Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.
Prerequisites
- Python 3.9+
- GPU with CUDA (recommended): NVIDIA GPU with 8GB+ VRAM for the small model, 16GB+ for medium, 40GB+ for large
- CPU-only: Possible but slow (~5-20 seconds per sentence vs. ~0.1-0.5 seconds with GPU)
- Disk space: 2-12 GB depending on model size
- pip or conda for package management
Step 1: Choose Your Model Size
| Model | Parameters | VRAM Required | Disk Space | Quality | Speed (GPU) |
|---|---|---|---|---|---|
nllb-200-distilled-600M | 600M | ~4 GB | ~2.3 GB | Good | ~100ms/sentence |
nllb-200-1.3B | 1.3B | ~8 GB | ~5 GB | Better | ~150ms/sentence |
nllb-200-distilled-1.3B | 1.3B | ~8 GB | ~5 GB | Better | ~120ms/sentence |
nllb-200-3.3B | 3.3B | ~16 GB | ~12 GB | Best | ~250ms/sentence |
Recommendation: Start with nllb-200-distilled-600M for testing. It fits on consumer GPUs and offers good quality. Upgrade to larger models for production use.
Step 2: Install Dependencies
pip install transformers torch sentencepiece accelerate
For GPU support, ensure you have the correct CUDA toolkit installed for your NVIDIA driver version.
Step 3: Download and Run
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Choose model (downloads ~2.3 GB on first run)
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Move to GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
def translate(text, src_lang="eng_Latn", tgt_lang="spa_Latn", max_length=400):
tokenizer.src_lang = src_lang
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
translated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_length=max_length
)
return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
# Test
result = translate("Hello, how are you?", src_lang="eng_Latn", tgt_lang="fra_Latn")
print(result) # "Bonjour, comment allez-vous ?"
Step 4: Language Codes
NLLB uses BCP-47-style language codes with script suffixes. Common codes:
| Language | NLLB Code |
|---|---|
| English | eng_Latn |
| Spanish | spa_Latn |
| French | fra_Latn |
| German | deu_Latn |
| Chinese (Simplified) | zho_Hans |
| Chinese (Traditional) | zho_Hant |
| Japanese | jpn_Jpan |
| Korean | kor_Hang |
| Arabic | arb_Arab |
| Hindi | hin_Deva |
| Russian | rus_Cyrl |
| Portuguese | por_Latn |
| Yoruba | yor_Latn |
| Swahili | swh_Latn |
The full list of 200+ language codes is available in the NLLB-200 model documentation on Hugging Face.
Step 5: Serve as an API
For production use, wrap the model in a REST API. Here is a minimal example using FastAPI:
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class TranslationRequest(BaseModel):
text: str
source_lang: str = "eng_Latn"
target_lang: str = "spa_Latn"
class TranslationResponse(BaseModel):
translated_text: str
source_lang: str
target_lang: str
@app.post("/translate", response_model=TranslationResponse)
async def translate_endpoint(request: TranslationRequest):
result = translate(request.text, request.source_lang, request.target_lang)
return TranslationResponse(
translated_text=result,
source_lang=request.source_lang,
target_lang=request.target_lang
)
Run with: uvicorn api:app --host 0.0.0.0 --port 8000
Step 6: Optimize for Production
Batch Translation
Process multiple texts in a single forward pass for higher throughput:
def translate_batch(texts, src_lang, tgt_lang, max_length=400):
tokenizer.src_lang = src_lang
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(device)
translated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
max_length=max_length
)
return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
Quantization for Reduced VRAM
Use 8-bit or 4-bit quantization to run larger models on smaller GPUs:
model = AutoModelForSeq2SeqLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
This roughly halves VRAM usage with minimal quality loss.
CTranslate2 for Speed
Convert the model to CTranslate2 format for 2-4x faster inference:
pip install ctranslate2
ct2-opus-mt-converter --model_name facebook/nllb-200-distilled-600M --output_dir nllb-ct2
Cloud Deployment Options
| Provider | GPU Instance | Monthly Cost (estimate) | Notes |
|---|---|---|---|
| AWS | g4dn.xlarge (T4) | $150-400 | Good for 600M model |
| GCP | n1-standard-4 + T4 | $150-350 | Similar to AWS |
| Azure | NC4as T4 v3 | $150-350 | Azure ecosystem |
| Lambda Labs | A10 24GB | $200-400 | GPU cloud specialist |
| RunPod | A10 24GB | $150-300 | Flexible GPU rental |
Translation API Pricing Calculator
Comparison with Commercial APIs
| Aspect | Self-Hosted NLLB | Google Translate API |
|---|---|---|
| Quality (high-resource) | Good (slightly lower) | High |
| Quality (low-resource) | Best available | Variable |
| Cost at 10M chars/month | $150-400 (fixed) | $200 |
| Cost at 100M chars/month | $150-400 (fixed) | $2,000 |
| Data privacy | Full control | Google’s infrastructure |
| Latency | 50-500ms | 50-150ms |
| Languages | 200+ | 130+ |
NLLB-200 vs Google Translate: Accuracy by Language Pair
Key Takeaways
- NLLB-200 can be set up locally in under 30 minutes with basic Python knowledge and a compatible GPU.
- The distilled 600M model is a good starting point — it fits on consumer GPUs and offers decent quality.
- For production deployment, use FastAPI or similar framework to serve as a REST API, and consider quantization and CTranslate2 for optimization.
- Self-hosting NLLB becomes cost-effective at around 10-20 million characters per month compared to commercial APIs.
- The main trade-off is infrastructure management — you are responsible for availability, scaling, and updates.
Next Steps
- Compare NLLB with alternatives: Read NLLB-200 vs Google Translate: Accuracy by Language Pair.
- See all model options: Check Best Translation AI in 2026: Complete Model Comparison.
- Learn about integration patterns: Read Translation AI for Developers: API Comparison and Integration Guide.
- Calculate cost savings: Use Translation API Pricing Calculator.
- Explore low-resource languages: See Low-Resource Languages: How NLLB and Aya Are Closing the Gap.