Why Your NLP Model's Accuracy Number Is Probably Misleading

A model that achieves 94% accuracy on a text classification task is not necessarily a good model. It may be a model that learned to classify correctly on your test set by exploiting features that have nothing to do with the construct you care about. It may perform well on average while failing systematically on exactly the cases that matter most to your organization. It may be measuring something adjacent to your target construct rather than the construct itself — close enough to look good in evaluation, far enough away to cause problems in production.

These are not exotic failure modes. They are common, and they tend to be invisible to standard evaluation practice.

This post is an argument for applying construct validity thinking — a framework developed in psychometrics and social science methodology — to NLP model evaluation. The framework doesn't replace standard metrics. It asks different questions alongside them, questions that standard metrics were never designed to answer.

The Problem with Accuracy as a Summary

Start with the most basic issue: accuracy assumes your test set is representative of the distribution of cases your model will encounter in production, and that errors are equally costly across classes.

Neither assumption usually holds.

If your production data has a different class distribution than your test set — because you stratified your test set to ensure balanced classes, or because the phenomenon you're classifying is genuinely rare in the wild — your accuracy figure is measuring performance on an artificial distribution. A model that achieves 94% accuracy on a balanced test set may achieve 60% precision in production on the minority class that is, in practice, the class you actually care about.

Class imbalance is well-understood and there are standard remedies. But the deeper problem is less commonly addressed: even if your test set is representative of the production distribution, accuracy tells you nothing about why your model is right when it's right, or what it learned to pay attention to in order to achieve that performance.

Construct Validity: A Brief Introduction

In psychometrics, a construct is a theoretical concept that cannot be directly observed — intelligence, anxiety, political ideology. A measure has construct validity to the extent that it actually measures the construct it claims to measure, rather than something correlated with it or something systematically different from it.

Establishing construct validity involves several distinct claims:

Content validity: Does the measure cover the full domain of the construct, or does it systematically undersample some aspects? A sentiment classifier trained primarily on product reviews may have poor content validity for sentiment in financial news, even if it achieves acceptable accuracy on a mixed test set.

Convergent validity: Does the measure correlate appropriately with other measures of the same construct? If your "toxicity" classifier disagrees substantially with a different well-validated toxicity classifier on cases where they should agree, that is evidence of a validity problem.

Discriminant validity: Does the measure fail to correlate with things it should be unrelated to? If your "economic anxiety" classifier is strongly predicted by post length, time of day, or author demographics in ways that shouldn't be related to anxiety, those correlations suggest the classifier learned to exploit proxy features rather than the target construct.

Predictive validity: Does the measure predict outcomes it should predict, given what the construct means? If a measure of "customer dissatisfaction" in support tickets doesn't predict churn, that's evidence of either a validity problem with the measure or a theoretical problem with the assumption that dissatisfaction predicts churn — and it's worth knowing which.

Applying This Framework to NLP

The translation from psychometrics to NLP evaluation is direct.

For content validity, ask: What kinds of texts expressing the target construct are underrepresented in my training data? If I'm building a classifier for financial risk language, did I train primarily on earnings calls? What about analyst reports, regulatory filings, internal communications? The model may have learned a narrow version of the construct that generalizes poorly to the full domain.

For convergent validity, compare your model's outputs on a held-out set against an independent measure — a different model, human annotations, or a rule-based system — specifically looking at the cases where they disagree. Systematic disagreement in one direction is a signal; random disagreement is noise. You want to understand which.

For discriminant validity, run your model on texts that should score near zero on your construct and examine the distribution of outputs. Run it on texts that differ from your training distribution on surface features — different length, different author, different platform — and check whether those surface features predict model output in ways they shouldn't. Interpretability tools (SHAP, attention visualization, probing classifiers) can help identify which features the model is attending to.

For predictive validity, wherever possible, connect model outputs to downstream outcomes you care about. If your NLP model is supposed to identify high-risk customer communications, does it actually predict escalations, legal issues, or churn at higher rates than a baseline? If not, the problem may be in the model — or in the underlying theory about what "high-risk" means.

A Practical Example

Consider a named entity recognition system built to identify drug mentions in clinical notes. Standard evaluation reports entity-level F1 on a held-out test set drawn from the same hospital system as the training data.

A construct validity analysis would ask:

Does the model perform equally well on brand names vs. generic names vs. abbreviations vs. misspellings? (Content validity — does it cover the full surface form distribution of the construct?)
Does it agree with a pharmacist's manual annotations on ambiguous cases? (Convergent validity)
Does it flag non-drug chemical mentions at an elevated rate? (Discriminant validity — is it classifying "drug-like language" rather than drug mentions specifically?)
Does entity extraction performance predict downstream outcomes like adverse event detection rates? (Predictive validity)

Each question can reveal a different failure mode. Standard F1 on a clean test set would likely miss all of them.

What This Changes in Practice

Construct validity analysis requires more work than running sklearn.metrics. It requires thinking carefully about what your construct actually means, designing evaluation protocols that probe the model's behavior across the relevant dimensions, and being willing to let the model fail by standards that your accuracy metric would never surface.

This is not comfortable work. It tends to reveal that models are less good than their headline metrics suggest. But it is the work that separates models that perform well in evaluation from models that perform well in production — and in production, the difference matters.

The good news is that this framework also gives you a principled basis for deciding when a model is good enough. "Good enough" is not 95% accuracy on a balanced test set. It is adequate content coverage, appropriate convergent and discriminant validity, and predictive validity on the outcomes your organization actually cares about. Those are harder standards to meet, but they are the right ones.

Why Your NLP Model's Accuracy Number Is Probably Misleading

The Problem with Accuracy as a Summary

Construct Validity: A Brief Introduction

Applying This Framework to NLP

A Practical Example

What This Changes in Practice

Why General-Purpose Language Models Struggle with Legal Text

Can You Tell If Text Was Written by an LLM?

Making a Monolingual Model Bilingual with Domain Adaptation

From Problem Framing to Production.