The Label Is a Hypothesis

There is a peculiar asymmetry in how ML projects allocate attention. Model architecture gets weeks of deliberation. Training procedures get carefully tuned. Evaluation metrics get debated. Labels — the annotations that determine what the model actually learns — often get an afternoon.

This is backwards. Everything downstream of the label is downstream of the decision about what to measure. A model cannot learn a concept its labels don't capture. No amount of architectural sophistication compensates for a poorly specified annotation scheme. Label quality is a ceiling, not a variable.

What a label actually is

A label is not a fact about the data. It is a claim about the relationship between something observable in the text and an underlying concept you care about.

When you ask an annotator to mark a paragraph in a court judgment as "legally significant," you are not asking them to identify an intrinsic property of that paragraph. You are asking them to apply a theoretical construct — legal significance — to an observable artifact. The construct exists in legal theory, not in the text. The annotation is a hypothesis: that this paragraph, in this document, instantiates the concept in a way that is consistent with how the construct is defined and how it will be used downstream.

Statisticians and social scientists have a name for this. Operationalization is the process of translating a theoretical concept into a measurable indicator. It has been a core methodological concern in quantitative social science for decades — because researchers learned, often painfully, that the gap between a construct and its operationalization is where validity goes to die.

ML has largely inherited this problem without inheriting the vocabulary for diagnosing it.

What goes wrong

Consider a legal tech team building a classifier to identify "aggressive litigation tactics" in EU court filings. The construct is meaningful — lawyers and judges have intuitions about what aggressive argumentation looks like, and a system that could identify it reliably would be useful for litigation analysis and judicial behavior research.

The team hires annotators, gives them a definition ("argumentative language that goes beyond standard legal advocacy"), and starts labeling. Inter-rater agreement comes back at 0.71 — respectable by most benchmarks, good enough to proceed. They train a model. It achieves 87% F1 on the held-out test set. It gets deployed.

Three things have gone wrong, none of which the F1 score reveals.

First, the annotators operationalized the construct differently. One annotator weighted rhetorical intensity — forceful language, strong assertions, explicit challenges to opposing counsel. Another weighted procedural aggression — frivolous objections, delays, motions designed to impose costs rather than advance arguments. Both are defensible readings of "aggressive litigation tactics." They are not the same construct. The model learned a mixture of whatever each annotator brought to the task, which is not a coherent concept at all.

High inter-rater agreement, in this context, is not reassuring. It means the annotators converged on surface features — long paragraphs, dense citation chains, adversarial framing — that co-occur with both interpretations of the construct. The model learned those surface features. Whether it learned anything about legal aggression in a theoretically meaningful sense is a different question, and 0.71 kappa doesn't answer it.

Second, the team labeled what was easy to observe rather than what actually matters. "Aggressive" language in EU court filings has a specific texture: it tends to appear in written pleadings, often targets the legal reasoning of prior decisions rather than opposing parties directly, and is constrained by the formal conventions of European legal writing in ways that differ substantially from, say, US litigation. An annotation scheme developed without that domain knowledge will drift toward surface features that are more salient to a general reader — heated language, explicit criticism, length — than to features that reflect legal aggression as practitioners understand it.

The model that results performs well on documents that look like the training data. It performs poorly, and in systematically biased ways, on documents where the surface features and the underlying construct come apart — which in a specialized legal corpus is not rare.

Third, the model learned spurious correlations the labels introduced. EU court filings from certain jurisdictions and case types are longer, more formally structured, and more citation-dense than others. If aggressive filings happen to be overrepresented in certain procedural contexts — preliminary rulings, infringement proceedings, state aid cases — the model will learn to associate those procedural contexts with aggression, even if the relationship is incidental. Deploying the model on a different distribution of case types will produce results that look inexplicable until you reconstruct what the labels were actually measuring.

Disagreement is data

The standard response to annotator disagreement is to treat it as noise — resolve it through adjudication, drop ambiguous cases, report the final agreement score, and move on. This is a mistake.

Disagreement between annotators is validity information. When two trained annotators applying the same guidelines reach different conclusions about the same document, that disagreement is telling you something about the construct: either it is underdefined, or it is genuinely ambiguous, or the operationalization captures multiple distinct phenomena that need to be separated. Any of these is worth knowing before you train a model.

The appropriate response to systematic disagreement is not adjudication. It is to go back to the construct definition, understand why the disagreement is occurring, and revise the operationalization until the source of disagreement is resolved at the conceptual level. This takes longer than pressing forward. It produces a model that learned what you intended it to learn.

In social science, this process has a name: construct validation. A measure has construct validity when it demonstrably captures the theoretical concept it is intended to measure — not just when annotators agree on its application. The validation process involves examining whether the measure correlates with other indicators of the same construct (convergent validity), whether it fails to correlate with indicators of different constructs (discriminant validity), and whether it predicts outcomes it theoretically should predict (predictive validity).

These tests are not standard ML practice. They should be.

A more useful annotation workflow

The practical implication is not that annotation is harder than you thought — it is that annotation should be treated as a research design problem rather than a labeling task.

Before annotation begins: define the construct with enough precision that edge cases can be resolved by reference to the definition, not by individual judgment. For "aggressive litigation tactics" in EU law, this means specifying what counts as aggression in the context of ECJ procedure, how it differs from vigorous-but-standard advocacy, and which document types and procedural contexts are in scope. This is legal domain knowledge, not ML knowledge. It cannot be outsourced to the annotation platform.

During annotation: treat disagreements as probes rather than errors. When annotators diverge on a document, examine why. Is the document genuinely ambiguous? Is the guideline underspecified? Are annotators importing different implicit theories of the construct? The answers refine the operationalization. They also tell you where the model will be uncertain in production — which is useful information to have before deployment.

After annotation: ask the question that most teams skip. If a model learned this label schema perfectly, would it solve our problem? The answer is sometimes no — not because the labels are wrong, but because the construct was specified for a task that differs from the downstream use in ways that only become apparent when you trace the path from label to application. Better to discover this before training than after deployment.

The confidence problem

There is one further complication worth naming. A model trained on poorly specified labels will typically be confident in its errors.

Confidence calibration reflects the training distribution, not ground truth. If annotators consistently labeled a particular surface pattern as aggressive — long, citation-dense paragraphs in infringement proceedings — the model will assign high probability to that class when it encounters that pattern, regardless of whether the underlying document is actually aggressive in any theoretically meaningful sense. The miscalibration is invisible in standard evaluation because the test set came from the same distribution as the training data.

This matters more in high-stakes applications than in low-stakes ones — which is precisely the context where domain-specific NLP is most likely to be deployed. A legal analysis tool that is confidently wrong about what constitutes an aggressive litigation tactic is not just inaccurate. It is actively misleading in a context where the consequences of being misled are real.

The solution is not a better model. It is a better-specified label.

Label design sits at the intersection of domain knowledge and measurement theory — which is why it tends to go wrong when it is treated as a purely technical task. If you are building a custom NLP system for a specialized domain and want to get the measurement right before training begins, let's talk.

What a label actually is

What goes wrong

Disagreement is data

A more useful annotation workflow

The confidence problem

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Text Was Written by an LLM?

Making a Monolingual Model Bilingual with Domain Adaptation

From Problem Framing to Production.