Making a Monolingual Model Bilingual with Domain Adaptation

Suppose you have a TinyBERT model fine-tuned on English legal text. It has learned the vocabulary, citation conventions, and argumentative structure of common law documents reasonably well. Now your corpus is half French. You need the model to work in both languages.

The obvious solution is to start over with a multilingual model — mBERT or XLM-RoBERTa — and accept the performance tradeoff that comes with spreading capacity across 100 languages when you only need two. A less obvious solution, and often a better one for this specific problem, is to adapt the English model to French through continued pretraining on a bilingual domain corpus. The result is a compact, domain-specialized model that handles both languages — and the reason it works as well as it does has everything to do with the specific nature of legal language.

The standard approach and its costs

Multilingual models are the default answer to multilingual NLP problems. They are pretrained on text from dozens or hundreds of languages simultaneously, which gives them cross-lingual representations that transfer reasonably well across tasks. For general-purpose applications, this is a sensible choice.

For specialized domains, the tradeoffs are less favorable. A model pretrained on 100 languages has allocated its representational capacity across all of them, which means any single language — and certainly any specialized register within a single language — gets a smaller share of that capacity than a monolingual model of the same size would provide. mBERT knows a great deal about French. It knows considerably less about French competition law, and the gap between its representations of "abus de position dominante" and "abuse of dominant position" reflects general multilingual pretraining rather than the precise doctrinal equivalence between them.

TinyBERT adds a second constraint. It is a distilled model, smaller and faster than full BERT, designed for deployment contexts where size and latency matter. Multilingual TinyBERT models exist but are rarer and generally less capable than their English-only counterparts. If you want a small, fast, domain-specialized model that works in two languages, the multilingual default gets you less than you might hope.

Continued pretraining on a bilingual corpus

The alternative is to take the English TinyBERT model and continue pretraining it — extending its masked language modeling training — on a bilingual corpus of English and French legal documents. The model was trained to predict masked tokens in English; it will now be trained to predict masked tokens in both English and French, on text that reflects the specific vocabulary and structure of the legal domain.

This is domain adaptation extended to cover a second language. The model's existing English legal representations provide a starting point; the continued pretraining on French text teaches the model to extend those representations into the new language. Because the fine-tuning data is domain-specific, the French representations it learns are legal French — not the French of general web text, but the French of EU directives, Commission decisions, and ECJ judgments.

The practical requirements are modest by the standards of pretraining from scratch. A bilingual corpus of EU legal documents — the kind that is publicly available from EUR-Lex and the EU's translation archives — provides sufficient coverage of both languages in the relevant domain. The compute cost is a fraction of full pretraining. You can do it on a MacBook Pro in a few hours. The result is a model that retains its English legal representations while acquiring French legal representations trained on the same domain.

Why legal text makes this work

The interesting question is why this approach produces results as strong as it does — why a model adapted from English to French on legal text achieves masked language modeling performance in French that is comparable to its English performance, rather than the degraded performance one might expect from adapting a monolingual model.

The answer lies in the specific linguistic relationship between English and French in the legal domain.

Legal English is substantially French-derived. The Norman Conquest deposited an enormous French vocabulary into English legal usage — "contract," "tort," "plaintiff," "defendant," "jury," "verdict," "evidence," "property," "attorney" — and that vocabulary has remained largely intact for eight centuries. The overlap is not merely etymological. Many of these terms retain similar or identical meanings in both languages in legal contexts, even where their general-language meanings have diverged.

EU legal text reinforces this overlap structurally. The EU produces all its legislation simultaneously in 24 languages, which means that English and French EU legal documents are translations of the same source texts, drafted to be legally equivalent. The vocabulary correspondences are not approximate — they are designed. "Proportionnalité" and "proportionality" mean exactly the same thing in this corpus because they were written to. The same is true for the procedural vocabulary, institutional terminology, and doctrinal concepts that appear throughout EU law.

This creates favorable conditions for bilingual domain adaptation. The model's English legal representations are not starting from scratch when it encounters French legal text. A significant share of the vocabulary is cognate or identical. The structural patterns — citation formats, section organization, argumentative sequence — are parallel by design. The model is not learning a new language so much as learning a parallel encoding of concepts it already represents in English.

The result, in practice, is masked language modeling performance in French that approaches English performance on domain-specific evaluation sets — an outcome that would not generalize to, say, adapting an English model to Finnish legal text, where the linguistic overlap is minimal and the structural parallels, while real, are not reinforced by shared vocabulary.

What this is and isn't good for

This approach is well-suited to organizations working specifically with English and French EU legal text — which describes a substantial share of legal tech applications targeting European institutions, member state courts applying EU law, and international organizations operating under EU frameworks. It produces a compact, fast, domain-specialized model that can be deployed without the infrastructure overhead of a large multilingual system.

It is not a general solution to multilingual NLP. The linguistic properties that make English-French legal adaptation work — the historical vocabulary overlap, the parallel translation corpus, the structural correspondence — are specific to this language pair in this domain. Adapting the same model to Arabic or Chinese legal text would require a different approach entirely, and probably a multilingual foundation model rather than a domain-adapted monolingual one.

It also requires a reasonably large bilingual domain corpus to work well. The EU's public document archives are sufficient for this purpose; a smaller organization without access to parallel legal text at scale would face a data constraint that changes the calculus.

Within those limits, it is a practical and underused technique for a common problem: you have a good English model, your corpus is bilingual, and you need to extend the model's capabilities without starting over.

Domain adaptation for specialized bilingual corpora involves decisions about corpus composition, training procedures, and evaluation design that depend heavily on the specific languages, domain, and downstream tasks involved. If you are working on a legal NLP application with multilingual requirements, let's talk.

Making a Monolingual Model Bilingual with Domain Adaptation

The standard approach and its costs

Continued pretraining on a bilingual corpus

Why legal text makes this work

What this is and isn't good for

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Text Was Written by an LLM?

Your Model Learned the Wrong Distribution

From Problem Framing to Production.