Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.

How to Set Up NLLB-200 Locally: Tutorial

Name: How to Set Up NLLB-200 Locally: Tutorial
Creator: NLLB
Published: 2026-03-08
License: https://creativecommons.org/licenses/by-nc/4.0/

NLLB-200 (No Language Left Behind) is Meta’s open-source translation model supporting 200+ languages. Running it locally gives you unlimited free translation with full data privacy. This tutorial walks you through setup from scratch.

Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.

Prerequisites

Python 3.9+
GPU with CUDA (recommended): NVIDIA GPU with 8GB+ VRAM for the small model, 16GB+ for medium, 40GB+ for large
CPU-only: Possible but slow (~5-20 seconds per sentence vs. ~0.1-0.5 seconds with GPU)
Disk space: 2-12 GB depending on model size
pip or conda for package management

Step 1: Choose Your Model Size

Model	Parameters	VRAM Required	Disk Space	Quality	Speed (GPU)
`nllb-200-distilled-600M`	600M	~4 GB	~2.3 GB	Good	~100ms/sentence
`nllb-200-1.3B`	1.3B	~8 GB	~5 GB	Better	~150ms/sentence
`nllb-200-distilled-1.3B`	1.3B	~8 GB	~5 GB	Better	~120ms/sentence
`nllb-200-3.3B`	3.3B	~16 GB	~12 GB	Best	~250ms/sentence

Recommendation: Start with nllb-200-distilled-600M for testing. It fits on consumer GPUs and offers good quality. Upgrade to larger models for production use.

Step 2: Install Dependencies

pip install transformers torch sentencepiece accelerate

For GPU support, ensure you have the correct CUDA toolkit installed for your NVIDIA driver version.

Step 3: Download and Run

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Choose model (downloads ~2.3 GB on first run)
model_name = "facebook/nllb-200-distilled-600M"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Move to GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

def translate(text, src_lang="eng_Latn", tgt_lang="spa_Latn", max_length=400):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)

    translated_tokens = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_length=max_length
    )

    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

# Test
result = translate("Hello, how are you?", src_lang="eng_Latn", tgt_lang="fra_Latn")
print(result)  # "Bonjour, comment allez-vous ?"

Step 4: Language Codes

NLLB uses BCP-47-style language codes with script suffixes. Common codes:

Language	NLLB Code
English	`eng_Latn`
Spanish	`spa_Latn`
French	`fra_Latn`
German	`deu_Latn`
Chinese (Simplified)	`zho_Hans`
Chinese (Traditional)	`zho_Hant`
Japanese	`jpn_Jpan`
Korean	`kor_Hang`
Arabic	`arb_Arab`
Hindi	`hin_Deva`
Russian	`rus_Cyrl`
Portuguese	`por_Latn`
Yoruba	`yor_Latn`
Swahili	`swh_Latn`

The full list of 200+ language codes is available in the NLLB-200 model documentation on Hugging Face.

Step 5: Serve as an API

For production use, wrap the model in a REST API. Here is a minimal example using FastAPI:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class TranslationRequest(BaseModel):
    text: str
    source_lang: str = "eng_Latn"
    target_lang: str = "spa_Latn"

class TranslationResponse(BaseModel):
    translated_text: str
    source_lang: str
    target_lang: str

@app.post("/translate", response_model=TranslationResponse)
async def translate_endpoint(request: TranslationRequest):
    result = translate(request.text, request.source_lang, request.target_lang)
    return TranslationResponse(
        translated_text=result,
        source_lang=request.source_lang,
        target_lang=request.target_lang
    )

Run with: uvicorn api:app --host 0.0.0.0 --port 8000

Step 6: Optimize for Production

Batch Translation

Process multiple texts in a single forward pass for higher throughput:

def translate_batch(texts, src_lang, tgt_lang, max_length=400):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(device)

    translated_tokens = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_length=max_length
    )

    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)

Quantization for Reduced VRAM

Use 8-bit or 4-bit quantization to run larger models on smaller GPUs:

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

This roughly halves VRAM usage with minimal quality loss.

CTranslate2 for Speed

Convert the model to CTranslate2 format for 2-4x faster inference:

pip install ctranslate2
ct2-opus-mt-converter --model_name facebook/nllb-200-distilled-600M --output_dir nllb-ct2

Cloud Deployment Options

Provider	GPU Instance	Monthly Cost (estimate)	Notes
AWS	g4dn.xlarge (T4)	$150-400	Good for 600M model
GCP	n1-standard-4 + T4	$150-350	Similar to AWS
Azure	NC4as T4 v3	$150-350	Azure ecosystem
Lambda Labs	A10 24GB	$200-400	GPU cloud specialist
RunPod	A10 24GB	$150-300	Flexible GPU rental

Translation API Pricing Calculator

Comparison with Commercial APIs

Aspect	Self-Hosted NLLB	Google Translate API
Quality (high-resource)	Good (slightly lower)	High
Quality (low-resource)	Best available	Variable
Cost at 10M chars/month	$150-400 (fixed)	$200
Cost at 100M chars/month	$150-400 (fixed)	$2,000
Data privacy	Full control	Google’s infrastructure
Latency	50-500ms	50-150ms
Languages	200+	130+

NLLB-200 vs Google Translate: Accuracy by Language Pair

Key Takeaways

NLLB-200 can be set up locally in under 30 minutes with basic Python knowledge and a compatible GPU.
The distilled 600M model is a good starting point — it fits on consumer GPUs and offers decent quality.
For production deployment, use FastAPI or similar framework to serve as a REST API, and consider quantization and CTranslate2 for optimization.
Self-hosting NLLB becomes cost-effective at around 10-20 million characters per month compared to commercial APIs.
The main trade-off is infrastructure management — you are responsible for availability, scaling, and updates.

Next Steps

Compare NLLB with alternatives: Read NLLB-200 vs Google Translate: Accuracy by Language Pair.
See all model options: Check Best Translation AI in 2026: Complete Model Comparison.
Learn about integration patterns: Read Translation AI for Developers: API Comparison and Integration Guide.
Calculate cost savings: Use Translation API Pricing Calculator.
Explore low-resource languages: See Low-Resource Languages: How NLLB and Aya Are Closing the Gap.

Sources

Meta AI: NLLB-200 Model Card — accessed March 25, 2026
arXiv: No Language Left Behind (2207.04672) — accessed March 25, 2026