Tutorials

How to Set Up NLLB-200 Locally: Tutorial

Updated 2026-03-10

Data Notice: Figures, rates, and statistics cited in this article are based on the most recent available data at time of writing and may reflect projections or prior-year figures. Always verify current numbers with official sources before making financial, medical, or educational decisions.

How to Set Up NLLB-200 Locally: Tutorial

NLLB-200 (No Language Left Behind) is Meta’s open-source translation model supporting 200+ languages. Running it locally gives you unlimited free translation with full data privacy. This tutorial walks you through setup from scratch.

Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.

Prerequisites

  • Python 3.9+
  • GPU with CUDA (recommended): NVIDIA GPU with 8GB+ VRAM for the small model, 16GB+ for medium, 40GB+ for large
  • CPU-only: Possible but slow (~5-20 seconds per sentence vs. ~0.1-0.5 seconds with GPU)
  • Disk space: 2-12 GB depending on model size
  • pip or conda for package management

Step 1: Choose Your Model Size

ModelParametersVRAM RequiredDisk SpaceQualitySpeed (GPU)
nllb-200-distilled-600M600M~4 GB~2.3 GBGood~100ms/sentence
nllb-200-1.3B1.3B~8 GB~5 GBBetter~150ms/sentence
nllb-200-distilled-1.3B1.3B~8 GB~5 GBBetter~120ms/sentence
nllb-200-3.3B3.3B~16 GB~12 GBBest~250ms/sentence

Recommendation: Start with nllb-200-distilled-600M for testing. It fits on consumer GPUs and offers good quality. Upgrade to larger models for production use.

Step 2: Install Dependencies

pip install transformers torch sentencepiece accelerate

For GPU support, ensure you have the correct CUDA toolkit installed for your NVIDIA driver version.

Step 3: Download and Run

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Choose model (downloads ~2.3 GB on first run)
model_name = "facebook/nllb-200-distilled-600M"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Move to GPU if available
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

def translate(text, src_lang="eng_Latn", tgt_lang="spa_Latn", max_length=400):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)

    translated_tokens = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_length=max_length
    )

    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

# Test
result = translate("Hello, how are you?", src_lang="eng_Latn", tgt_lang="fra_Latn")
print(result)  # "Bonjour, comment allez-vous ?"

Step 4: Language Codes

NLLB uses BCP-47-style language codes with script suffixes. Common codes:

LanguageNLLB Code
Englisheng_Latn
Spanishspa_Latn
Frenchfra_Latn
Germandeu_Latn
Chinese (Simplified)zho_Hans
Chinese (Traditional)zho_Hant
Japanesejpn_Jpan
Koreankor_Hang
Arabicarb_Arab
Hindihin_Deva
Russianrus_Cyrl
Portuguesepor_Latn
Yorubayor_Latn
Swahiliswh_Latn

The full list of 200+ language codes is available in the NLLB-200 model documentation on Hugging Face.

Step 5: Serve as an API

For production use, wrap the model in a REST API. Here is a minimal example using FastAPI:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class TranslationRequest(BaseModel):
    text: str
    source_lang: str = "eng_Latn"
    target_lang: str = "spa_Latn"

class TranslationResponse(BaseModel):
    translated_text: str
    source_lang: str
    target_lang: str

@app.post("/translate", response_model=TranslationResponse)
async def translate_endpoint(request: TranslationRequest):
    result = translate(request.text, request.source_lang, request.target_lang)
    return TranslationResponse(
        translated_text=result,
        source_lang=request.source_lang,
        target_lang=request.target_lang
    )

Run with: uvicorn api:app --host 0.0.0.0 --port 8000

Step 6: Optimize for Production

Batch Translation

Process multiple texts in a single forward pass for higher throughput:

def translate_batch(texts, src_lang, tgt_lang, max_length=400):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(device)

    translated_tokens = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_length=max_length
    )

    return tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)

Quantization for Reduced VRAM

Use 8-bit or 4-bit quantization to run larger models on smaller GPUs:

model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

This roughly halves VRAM usage with minimal quality loss.

CTranslate2 for Speed

Convert the model to CTranslate2 format for 2-4x faster inference:

pip install ctranslate2
ct2-opus-mt-converter --model_name facebook/nllb-200-distilled-600M --output_dir nllb-ct2

Cloud Deployment Options

ProviderGPU InstanceMonthly Cost (estimate)Notes
AWSg4dn.xlarge (T4)$150-400Good for 600M model
GCPn1-standard-4 + T4$150-350Similar to AWS
AzureNC4as T4 v3$150-350Azure ecosystem
Lambda LabsA10 24GB$200-400GPU cloud specialist
RunPodA10 24GB$150-300Flexible GPU rental

Translation API Pricing Calculator

Comparison with Commercial APIs

AspectSelf-Hosted NLLBGoogle Translate API
Quality (high-resource)Good (slightly lower)High
Quality (low-resource)Best availableVariable
Cost at 10M chars/month$150-400 (fixed)$200
Cost at 100M chars/month$150-400 (fixed)$2,000
Data privacyFull controlGoogle’s infrastructure
Latency50-500ms50-150ms
Languages200+130+

NLLB-200 vs Google Translate: Accuracy by Language Pair

Key Takeaways

  • NLLB-200 can be set up locally in under 30 minutes with basic Python knowledge and a compatible GPU.
  • The distilled 600M model is a good starting point — it fits on consumer GPUs and offers decent quality.
  • For production deployment, use FastAPI or similar framework to serve as a REST API, and consider quantization and CTranslate2 for optimization.
  • Self-hosting NLLB becomes cost-effective at around 10-20 million characters per month compared to commercial APIs.
  • The main trade-off is infrastructure management — you are responsible for availability, scaling, and updates.

Next Steps