J·C·Fjelstul
Consulting LLC
By Josh Fjelstul, PhD/Technical/April 15, 2025/8 min read

Your Model Learned the Wrong Distribution

A model that performs well on your test set has demonstrated one thing: it learned the training distribution. Whether that distribution matches production is a separate question, and most evaluation pipelines don't ask it.

A model that performs well on your test set has demonstrated exactly one thing: it learned the training distribution well. Whether that distribution resembles the data the model will encounter in production is a separate question — and in most ML projects, it goes unasked until something goes wrong.

This is not a new observation. Distribution shift is a well-documented failure mode. But the standard response — gather more data, retrain periodically, monitor for drift — treats the symptom rather than the cause. The cause is that most ML projects never explicitly reason about where their data comes from, and therefore cannot anticipate how production data will differ from training data. Social scientists have a name for this kind of reasoning: analyzing the data-generating process. It is routine in quantitative research design. It is underused in applied ML, and its absence is expensive.

The data-generating process

The data-generating process is the set of mechanisms — institutional, behavioral, social, technical — that produce the observations in your dataset. Understanding it means asking not just what your data contains, but why it contains those things and not others.

This distinction matters because your training data is never a neutral sample of the world. It is a sample of what was recorded, preserved, accessible, and collected — shaped by decisions made long before your project began. A model trained on that sample learns the regularities of that specific collection process, not the regularities of the underlying phenomenon you care about.

In legal NLP, the data-generating process is unusually legible, which makes it a useful domain for thinking through these issues concretely.

A concrete failure: procedure as a hidden confounder

Consider a team building a classifier to identify paragraphs in EU Court of Justice opinions that contain the court's legal reasoning — the holding, as distinct from the procedural history, the parties' arguments, or the advocate general's opinion.

They train on a corpus of preliminary ruling decisions, which dominate the published ECJ case law by volume. Preliminary rulings have a distinctive structure: a referring court poses specific questions, and the ECJ answers them in a predictable sequence. The reasoning paragraphs tend to appear in consistent positions, use characteristic transitional language, and follow a relatively constrained argumentative pattern. A model trained on this corpus learns these regularities. It achieves 91% F1 on a held-out test set drawn from the same distribution.

Then the team deploys the model against a corpus that includes infringement proceedings, annulment actions under Article 263 TFEU, and appeals from the General Court. The data-generating process for these document types is different in ways that matter. Infringement proceedings involve extended factual disputes with governments; the legal reasoning is interleaved with factual findings in ways that preliminary rulings rarely are. Annulment actions often turn on procedural admissibility before reaching the merits — the structure of the opinion reflects that priority. Appeals reproduce and respond to the reasoning of the court below, creating nested attribution problems that a classifier trained on preliminary rulings has never encountered.

Performance drops. The team is surprised. They should not have been. The training corpus did not represent the data-generating process for production documents — it represented one procedural stream of a court whose published output spans several distinct procedural contexts, each with its own structural logic.

What reasoning about the data-generating process would have caught

Before training, the right questions are: what produced the documents in this corpus, and is that process representative of what the model will see in production?

For the ECJ corpus, those questions immediately surface the procedural heterogeneity. Preliminary rulings dominate by volume not because they are representative of EU legal reasoning in general, but because of how the referral mechanism works — national courts generate them continuously, while infringement proceedings and annulment actions are initiated by the Commission or member states and proceed more slowly. Volume is a product of the data-generating process, not a proxy for representativeness.

A random train/test split is particularly ill-suited to corpora with this structure. Random splits assume that observations are exchangeable — that any document is as likely to appear in training as in test. In a corpus stratified by procedure type, a random split will produce a test set with the same procedural distribution as training, which means it will not detect the failure mode at all. The model appears to generalize. It has learned to generalize within a distribution, which is not the same thing.

The appropriate evaluation design stratifies by the dimensions along which production data will differ: procedure type, time period, originating jurisdiction, language. A model that generalizes across these strata has demonstrated something meaningful. A model that generalizes only within them has demonstrated that the test set was designed by the same process that generated the training data.

The same problem in retrieval systems

The data-generating process mismatch is not unique to classifiers. It appears in retrieval and recommendation systems in a form that is equally easy to miss and equally damaging in production.

A semantic search system built on a bi-encoder and cross-encoder architecture — the standard setup for legal document retrieval — is typically evaluated on a set of query-document pairs: a question and the document or passage that correctly answers it. Recall@10 measures how often the correct document appears in the top ten retrieved results. On a well-constructed evaluation set, this number can look very good.

The question is where the evaluation queries came from. If they were written by developers or annotators to cover the document corpus, their data-generating process is "someone familiar with the corpus and the system wrote a question that has a known answer." The data-generating process for production queries is "a lawyer, researcher, or analyst, who may or may not be familiar with the corpus, is looking for something they need." These are not the same process. Developers write precise, well-formed queries. Practitioners write the kinds of questions they actually have — sometimes vague, sometimes multi-part, sometimes using terminology that differs from the documents they are searching.

A retrieval system that performs well on developer-generated queries and poorly on practitioner queries has not failed at retrieval. It has been evaluated against the wrong distribution. The recall@10 score was accurate. It measured the wrong population of queries.

For EU legal document retrieval specifically, this gap has a predictable structure. Evaluation queries tend to use the formal legal terminology that appears in the documents — "proportionality assessment under Article 5 TEU," "state aid compatibility under Article 107(3) TFEU." Practitioner queries tend to be more contextual and less formally specified: "cases where the Commission found market definition too narrow," "judgments on procedural rights in infringement proceedings." The semantic distance between these query types is non-trivial, and a system not evaluated on practitioner-style queries will not reveal how it handles them.

Fine-tuning embedding models and catastrophic forgetting

A related problem arises when fine-tuning a general embedding model — a BERT-style backbone — on domain-specific documents to improve downstream retrieval or classification performance.

The standard approach is to continue pretraining the backbone on your domain corpus before adding a task-specific head. This works well when the domain corpus is representative of the documents the model will encode in production. When it is not, the fine-tuning embeds the data-generating process of the fine-tuning corpus into the model's representations.

Suppose you fine-tune a legal embedding model on a corpus of ECJ preliminary rulings — again, the most available document type by volume — and then deploy it against a broader corpus that includes General Court judgments, national court decisions citing EU law, and Commission decisions. The model's representations have been shaped by the linguistic and structural regularities of preliminary rulings. Documents from different procedural contexts, which have different stylistic and argumentative patterns, will be encoded less faithfully.

The tempting fix is to continue fine-tuning on the underrepresented document types once the problem is discovered in production. This can work, but it carries a specific risk: catastrophic forgetting. Neural networks do not learn new distributions by extending their existing representations — they adjust weights in ways that can degrade performance on the original training distribution. A model that has learned to encode preliminary rulings well may, after continued fine-tuning on Commission decisions, encode both worse than a model that was trained on the combined corpus from the start.

This is not an unsolvable problem. Techniques like elastic weight consolidation and rehearsal-based methods exist specifically to mitigate catastrophic forgetting. But they add complexity, and the cleaner solution is to reason about the data-generating process before fine-tuning begins — to ask what document types the model will encounter in production and ensure they are represented in the fine-tuning corpus from the start. Going back to fix a distributional mismatch after the fact is more expensive than anticipating it.

The practical implication

Reasoning about the data-generating process does not require additional data. It requires asking, before data collection begins, what data-generating process produced the training corpus and how those mechanisms relate to the production environment. The answer shapes every subsequent decision: what to collect, how to stratify the evaluation, where to expect degraded performance, and what monitoring is needed once the system is live.

For legal document analysis, the data-generating process is defined by institutional and procedural structure — court jurisdiction, case type, procedural stage, temporal period, language regime. These dimensions are almost always available as metadata. If you already have the data, using them to design the evaluation costs nothing; if you do not, collecting it may be worth the cost. Ignoring them produces a test set that flatters your model and a deployment that surprises your organization (not in a good way).

The 91% F1 score was accurate. It measured the wrong thing.


If you are building an NLP system for legal document analysis and want to make sure the evaluation reflects production conditions rather than training conditions, let's talk.

Recommended Writing
View all
Ready to build?

From Problem Framing to Production.

Whether you need a domain-adapted text classification model, or an end-to-end recommender system with RAG, I help you ask the right questions, frame your business problem, and build cutting-edge AI/ML solutions.

Schedule a consultation