Why General-Purpose Language Models Struggle with Legal Text

Legal tech teams building NLP systems face a choice that is easy to defer and expensive to get wrong. GPT-4 is capable, available, and requires no training data. A fine-tuned BERT model is cheaper, faster, and more auditable. Domain adaptation produces the best results on specialized tasks but takes weeks and requires a curated corpus. The temptation is to start with whatever is most convenient and iterate from there.

This approach produces a recognizable pattern: strong demo performance, inconsistent production results, and a post-mortem that attributes the gap to "data quality issues" or "edge cases." The actual cause is usually the domain gap — a mismatch between what the model was trained on and what legal text actually is. Understanding that gap is the prerequisite for choosing the right approach.

What legal text actually is

Legal language is not technical jargon layered on top of ordinary English. It is a distinct register with its own vocabulary, citation conventions, argumentative structure, and cross-lingual semantics — each of which creates specific challenges for models trained on general text.

Vocabulary that means something different. Legal terms carry precise meanings that diverge substantially from general usage. "Consideration" in contract law is not thoughtfulness. "Discharge" is not dismissal. "Party" is not a social event. "Holding" is not a grip. In EU law, the divergence is compounded by multilingual origin: terms like "proportionality," "subsidiarity," and "margin of appreciation" have specific doctrinal content that their ordinary-language translations do not convey. A model that has learned "proportionality" from general web text has learned the wrong thing.

Citation structure as meaning. Legal reasoning is built on authority. A judicial opinion that cites Francovich is making a claim about state liability for failure to implement EU directives — the citation is not a reference, it is an argument. A model that cannot parse citation structure cannot understand what the text is claiming. This is not a minor limitation for information extraction or reasoning tasks. It is the difference between understanding the document and processing its surface features.

Argumentative structure that encodes substance. Judicial opinions, advocate general opinions, and legal briefs follow rhetorical conventions that carry meaning structurally. In ECJ opinions, the sequence of sections — admissibility, legal framework, assessment — is not formatting. It reflects the logical structure of the court's reasoning. A model that treats all paragraphs as equivalent has discarded information that a lawyer would use automatically.

Multilingual complexity. EU legal text exists in 24 official languages, all equally authentic. Treaty provisions were negotiated across language versions; divergences between them are themselves legally significant. A model trained on English web text that encounters a German-language Commission decision citing a French-language judgment applying a concept from Spanish-language treaty text is not simply doing multilingual NLP. It is navigating a legal system that was designed to operate across languages simultaneously.

Consequences of error. Wrong answers in legal NLP are not benchmark misses. They are wrong answers about what a court held, what a statute requires, or what a contract obliges. The stakes calibrate the acceptable error rate in ways that general-purpose benchmark performance does not.

What general-purpose pretraining provides — and doesn't

BERT and its variants were pretrained on English Wikipedia and the BooksCorpus. GPT-family models were trained on larger and more varied corpora, but the composition of those corpora reflects the distribution of text on the internet — which skews heavily toward informal prose, consumer content, and general news. Legal text, particularly specialized legal text from international institutions, is a small fraction of any general pretraining corpus.

What this means in practice is that a general model's representations of legal terminology reflect general usage. When the model encounters "consideration" in a contract clause, its internal representation draws on the full distribution of contexts in which "consideration" appeared during pretraining — the vast majority of which are non-legal. The legal meaning is present, but diluted. For tasks where the legal meaning is the only relevant meaning — extracting contractual obligations, classifying argument types, identifying the operative provision in a directive — this dilution matters.

The problem is not that general models have not encountered legal text. They have. The problem is that they have encountered too little of it, in too unsystematic a way, to develop representations that reflect legal usage reliably. Increasing model size helps at the margins. It does not solve a data composition problem.

Three approaches and when each is appropriate

The choice between fine-tuning a general model, domain adaptation, and prompting a frontier model is not a question of which approach is best in general. It is a question of which approach is most appropriate for a specific task, given the severity of the domain gap, the volume of data, the latency and cost requirements, and what happens when the system is wrong.

Fine-tuning a general model — taking a pretrained BERT-style model and training a task-specific head on labeled examples from your domain — is appropriate when the domain gap is moderate and the task is well-defined. Sentiment classification on legal documents, for instance, does not require a deep understanding of legal semantics; the signal is largely lexical and stylistic, and a fine-tuned general model will handle it adequately. The approach is fast, cheap, and works well for tasks where the underlying language patterns are not highly specialized. It has a ceiling: for tasks that depend on correct representation of domain-specific terminology or citation structure, fine-tuning a general model adjusts the task head while leaving the domain-impoverished representations intact.

Domain adaptation — continuing pretraining on a large domain corpus before fine-tuning for specific tasks — addresses the representation problem directly. The model is exposed to enough legal text, in a controlled and domain-consistent way, that its representations of legal terminology shift toward legal usage. The results on specialized tasks are meaningfully better than fine-tuning alone, particularly for extraction and classification tasks that depend on precise semantic distinctions. The costs are real: continued pretraining requires a substantial domain corpus, GPU compute, and time. For organizations building systems that will run at scale on specialized legal text over a long horizon, the investment is justified. For a proof of concept or a low-volume application, it is probably not.

Prompting or fine-tuning a frontier model is appropriate for flexible, lower-volume tasks where the output requires synthesis or generation — drafting, summarization, question answering over heterogeneous documents — and where latency and cost are acceptable. For legal applications specifically, this approach has failure modes worth naming. Frontier models hallucinate citations with confidence; an LLM asked to identify the legal basis for a Commission decision may return a plausible-sounding but nonexistent regulation. They apply legal standards inconsistently across documents, producing results that are difficult to audit when the standard is applied one way to one document and differently to another. And because frontier models are updated by their providers, a prompt that produces reliable results today may not produce reliable results after the next model update — a reproducibility problem that matters for any application where consistency is a requirement.

None of these failure modes are disqualifying for appropriate use cases. They are disqualifying for high-stakes extraction and classification tasks where accuracy and auditability are required.

A decision framework

The right questions, in roughly this order:

Does the task require generation? If yes, a generative model is appropriate and the question is which one and at what cost. If no — if the task is classification, extraction, span detection, or similarity — a discriminative model is almost certainly better, and the domain gap determines whether fine-tuning or domain adaptation is warranted.

How severe is the domain gap? If the task depends on correct representation of specialized terminology, citation structure, or argumentative conventions, the gap is severe and domain adaptation is worth considering. If the task is relatively insensitive to domain-specific semantics, fine-tuning a general model is probably sufficient.

What is the volume and latency requirement? High-volume, low-latency applications — processing thousands of documents continuously, or responding in real time — cannot be built on frontier model APIs at reasonable cost. They require local inference, which means a smaller, fine-tuned or domain-adapted model.

What are the consequences of error, and what does auditability require? For applications in legal research, compliance, or regulatory analysis, every output needs to be traceable to a source span in the document. Generative models cannot provide this by construction. Discriminative models can.

What is the budget and timeline for the initial system? Domain adaptation is the right long-term investment for a serious legal NLP application. It is not the right answer for a three-week proof of concept. Sequencing matters: start with fine-tuning to validate the task, move to domain adaptation when you have evidence that the task is worth the investment.

The general principle

Legal text is one of the harder instances of a general problem: deploying NLP systems in domains where the data-generating process, vocabulary, and structural conventions differ substantially from general pretraining corpora. The same analysis applies to clinical documentation, financial filings, scientific literature, and any specialized corpus where precision matters and errors have consequences.

The question is never which model is most impressive on benchmarks. It is which approach is most appropriate for the specific task, the specific corpus, and the specific consequences of being wrong. General-purpose models are general-purpose by design. When the purpose is specific enough, that design is a limitation.

If you are evaluating approaches for a legal NLP application — or any specialized-domain document processing task — and want to think through the tradeoffs before committing to an architecture, get in touch. The NLP Model Training and Domain Adaptation services are both relevant depending on where you are in the process.

Why General-Purpose Language Models Struggle with Legal Text

What legal text actually is

What general-purpose pretraining provides — and doesn't

Three approaches and when each is appropriate

A decision framework

The general principle

Can You Tell If Text Was Written by an LLM?

Making a Monolingual Model Bilingual with Domain Adaptation

Your Model Learned the Wrong Distribution

From Problem Framing to Production.