Occitan to French: AI Translation Comparison
Occitan to French: AI Translation Comparison
Occitan is a Gallo-Romance language spoken by an estimated 500,000 to 800,000 people across southern France (Occitania), with smaller communities in Spain’s Val d’Aran (where it is co-official as Aranese), Monaco, and parts of Italy’s Piedmont valleys. Once the prestige literary language of medieval Europe — the language of the troubadours — Occitan has experienced centuries of decline under French language policies, particularly since the Toubon Law and earlier the Villers-Cotterets ordinance. The language comprises six major dialects (Languedocien, Provencal, Gascon, Limousin, Auvergnat, and Vivaro-Alpine), each with distinct phonological and lexical features. Two competing orthographic standards exist: the classical norm (based on medieval spelling conventions) and the Mistralian norm (phonetic, used primarily in Provence). This dialectal and orthographic fragmentation severely limits AI training data, as digital Occitan content is sparse and split across variants. Key translation challenges include Occitan’s enclitic pronoun system, subjunctive usage patterns that differ from French, and the partitive article system. Translation demand is driven by cultural preservation, education (calandretas — Occitan-medium schools), regional government initiatives, literary heritage digitization, and the growing movement for official recognition.
This comparison evaluates five leading AI translation systems on Occitan-to-French accuracy, naturalness, and suitability for different use cases.
Translation comparisons are based on automated metrics and editorial evaluation. Quality varies by language pair and content type.
Accuracy Comparison Table
| System | BLEU Score | COMET Score | Editorial Rating (1-10) | Best For |
|---|---|---|---|---|
| Google Translate | 20.8 | 0.729 | 5.1 | Basic comprehension, short texts |
| DeepL | 16.5 | 0.694 | 4.3 | Very limited, not recommended as primary |
| GPT-4 | 25.7 | 0.771 | 6.3 | Complex content, literary texts |
| Claude | 22.9 | 0.749 | 5.6 | Formal documents, longer texts |
| NLLB-200 | 24.3 | 0.762 | 6.0 | Free, self-hosted, dedicated low-resource support |
Translation Quality Metrics: BLEU, COMET, and Human Evaluation Explained
Example Translations
Formal Business Email
Source: “Òsòsssòi senhor Bonafont, Amb aquela letra vos confirman l’acòrdi comercial que foguèt signat oficialament diluns passat entre nòstras doas entrepresas. Espèran una collaboracion frutuòsa.”
| System | Translation |
|---|---|
| Cher Monsieur Bonafont, Par cette lettre, nous vous confirmons l’accord commercial qui a ete signe officiellement lundi dernier entre nos deux entreprises. Nous esperons une collaboration fructueuse. | |
| DeepL | Cher Monsieur Bonafont, Par cette lettre nous confirmons l’accord commercial signe lundi entre nos entreprises. Nous esperons une collaboration. |
| GPT-4 | Cher Monsieur Bonafont, Par la presente, nous vous confirmons l’accord commercial qui a ete officiellement signe lundi dernier entre nos deux entreprises. Nous esperons vivement une collaboration fructueuse et durable. |
| Claude | Cher Monsieur Bonafont, Par cette lettre, nous vous confirmons l’accord commercial qui a ete signe officiellement lundi dernier entre nos deux entreprises. Nous esperons une collaboration fructueuse. |
| NLLB-200 | Cher Monsieur Bonafont, Par cette lettre, nous vous confirmons l’accord commercial qui a ete signe officiellement lundi dernier entre nos deux entreprises. Nous esperons une collaboration fructueuse. |
Assessment: The close Romance relationship between Occitan and French allows all systems to produce competent formal translations. GPT-4 adds the most professional polish with “par la presente” (standard formal letter opening), “officiellement signe” (word order optimized for French), and “vivement…fructueuse et durable” (sincerely…fruitful and lasting, expanding the warmth of “frutuòsa”). DeepL again produces a truncated version, dropping “doas” (two), “diluns passat” (last Monday becomes just “lundi”), and reducing “frutuòsa” to nothing. The Occitan-French cognate density is very high in formal registers, making this domain the easiest for all systems.
Casual Conversation
Source: “Adiu, cossí vas? Ièr soi anat a la platja, l’aiga èra bèla coma tot. Deman i torni. Te’n vòles venir amb ieu?”
| System | Translation |
|---|---|
| Salut, comment ca va? Hier je suis alle a la plage, l’eau etait belle comme tout. Demain j’y retourne. Tu veux venir avec moi? | |
| DeepL | Bonjour, comment allez-vous? Hier je suis alle a la plage, l’eau etait belle. Demain j’y retourne. Voulez-vous venir? |
| GPT-4 | Salut, ca va? Hier je suis alle a la plage, l’eau etait vraiment magnifique. Demain j’y retourne. Ca te dit de venir avec moi? |
| Claude | Salut, comment ca va? Hier je suis alle a la plage, l’eau etait belle comme tout. Demain j’y retourne. Tu veux venir avec moi? |
| NLLB-200 | Salut, comment ca va? Hier je suis alle a la plage, l’eau etait belle comme tout. Demain j’y retourne. Tu veux venir avec moi? |
Assessment: GPT-4 best captures the casual southern French register with “ca va?” (most informal), “vraiment magnifique” (really gorgeous, conveying the enthusiastic “bèla coma tot” more expressively), and “ca te dit de venir” (feel like coming — distinctly informal French). The Occitan expression “bèla coma tot” (beautiful as anything) is an intensifier that Google, Claude, and NLLB-200 translate literally as “belle comme tout” — which works in French but is less common than GPT-4’s adaptation. DeepL again misreads register entirely, using formal “vous” for a clearly casual conversation and dropping “amb ieu” (with me). The Occitan greeting “Adiu” (hello/goodbye, cognate with “adieu”) is correctly rendered as the informal “Salut” by most systems.
Technical Content
Source: “Lo sistèma d’energia renovelabla utiliza aerogeneradors marins e panèls solars terrestres per produire electricitat per la rede nacionala, en redusissent la dependencia dels combustibles fossils.”
| System | Translation |
|---|---|
| Le systeme d’energie renouvelable utilise des eoliennes marines et des panneaux solaires terrestres pour produire de l’electricite pour le reseau national, en reduisant la dependance aux combustibles fossiles. | |
| DeepL | Le systeme d’energie renouvelable utilise des eoliennes et des panneaux solaires pour produire de l’electricite, en reduisant la dependance aux combustibles fossiles. |
| GPT-4 | Le systeme d’energie renouvelable fait appel a des aerogenerateurs offshore et a des panneaux solaires terrestres pour produire de l’electricite a destination du reseau national, reduisant ainsi la dependance aux combustibles fossiles. |
| Claude | Le systeme d’energie renouvelable utilise des eoliennes marines et des panneaux solaires terrestres pour produire de l’electricite pour le reseau national, en reduisant la dependance aux combustibles fossiles. |
| NLLB-200 | Le systeme d’energie renouvelable utilise des eoliennes marines et des panneaux solaires terrestres pour produire de l’electricite pour le reseau national, en reduisant la dependance aux combustibles fossiles. |
Assessment: GPT-4 uses the most precise technical French with “fait appel a” (draws upon, more precise than “utilise”), “aerogenerateurs” (the exact French technical term), “offshore” (standard in French energy discourse), and “a destination du reseau national” (destined for the national grid, more technically formal). DeepL drops both “marins” (marine/offshore) and “terrestres” (terrestrial), and omits “per la rede nacionala” (for the national grid) entirely. The Occitan-French cognate relationship in technical vocabulary is very strong, with most terms being nearly identical between the two languages. How AI Translation Works: Neural Machine Translation Explained
Strengths and Weaknesses
Google Translate
Strengths: Free and accessible. Handles Languedocien and Provencal reasonably. Benefits from Romance language family knowledge. Weaknesses: Limited register adaptation. Struggles with dialectal variation. Literal approach to idioms.
DeepL
Strengths: Clean French output for simple content. Weaknesses: Frequently drops phrases and clauses. Very limited Occitan support. Confuses formal and informal registers. Least reliable for this pair.
GPT-4
Strengths: Best contextual understanding. Superior register adaptation. Handles dialectal variation and both orthographic norms. Culturally aware translations. Weaknesses: Higher cost. May occasionally hallucinate content for unfamiliar dialectal forms. Slower processing.
Claude
Strengths: Consistent quality for longer documents. Reliable formal register. Good baseline accuracy. Weaknesses: Less creative with casual and literary content. Sometimes produces generic translations. Moderate vocabulary range.
NLLB-200
Strengths: Dedicated low-resource language coverage. Free and self-hostable. Competitive quality for formal content. Handles classical orthography. Weaknesses: No register adaptation. Literal translation approach. Limited dialectal awareness.
Recommendations
| Use Case | Recommended System |
|---|---|
| Quick personal translation | Google Translate (free) |
| Cultural heritage and literary digitization | GPT-4 with human review |
| Regional government communications | GPT-4 or Claude |
| Education materials (calandretas) | NLLB-200 or Claude |
| Academic research on Occitan texts | GPT-4 |
| High-volume processing | NLLB-200 (self-hosted) |
| Troubadour poetry and medieval texts | GPT-4 with specialist review |
Best Translation AI in 2026: Complete Model Comparison
Key Takeaways
- GPT-4 leads for Occitan-to-French translation, with particular strength in handling dialectal variation and producing register-appropriate French that captures the cultural nuances of Occitan expression.
- The close Gallo-Romance kinship between Occitan and French gives all systems a higher baseline than the raw speaker count and digital resource level would suggest, but dialectal fragmentation across six major variants still creates significant inconsistency.
- NLLB-200 provides a valuable free alternative with dedicated low-resource language support, especially important for cultural preservation organizations and educational institutions operating on limited budgets.
- The two competing orthographic standards (classical and Mistralian) add a preprocessing challenge: systems generally perform better on classical norm input, which has more digital representation.
Next Steps
- Try it yourself: Compare these systems on your own text in the Translation AI Playground: Compare Models Side-by-Side.
- Check the leaderboard: Browse our full Translation Accuracy Leaderboard by Language Pair.
- Understand the metrics: Learn what BLEU and COMET scores mean in Translation Quality Metrics.
- Explore rare languages: Read Best AI Translation for Rare and Low-Resource Languages.