How Quantitative Social Scientists Can Contribute to ML Projects

Machine learning has a deployment problem. Models that perform well in development fail in production at rates the field is still coming to terms with. The standard diagnosis is technical: the training data was insufficient, the architecture was wrong, the hyperparameters were not tuned carefully enough. More data, better models, improved training procedures.

This diagnosis is sometimes correct. It is also incomplete in ways that matter. A significant share of production failures have a different cause: the problem was defined incorrectly, the measurement was invalid, or the evaluation was designed against the wrong conditions. These are not engineering failures. They are methodological failures — and they are not visible to a diagnostic framework that looks only at technical causes.

Quantitative social science has been thinking carefully about exactly these problems for decades. Not because social scientists are better engineers — they are generally not — but because the problems they study forced them to develop tools for a challenge that ML is only now confronting seriously: how to draw valid inferences from messy, observational data about complex human phenomena, using measures that are always imperfect proxies for the concepts you actually care about. The tools they developed are largely absent from ML practice. Their absence is not costless.

Four concepts that transfer directly

Construct validity

The central question in measurement theory is whether your measure actually captures the concept you intend to measure. Social scientists formalized this under the heading of construct validity, which decomposes into three testable components.

Content validity asks whether the measure covers the full domain of the construct. A measure of "legal argumentation quality" that only captures citation density has content validity problems — citation density is one indicator of quality, but quality encompasses logical structure, relevance, and persuasive coherence that citation counts do not reflect. Applied to ML: does your label schema capture the full range of the target concept, or does it operationalize a convenient proxy that correlates with the concept under normal conditions and diverges from it in the cases that matter most?

Convergent validity asks whether your measure correlates with other measures of the same construct. If two methods are both measuring "contract risk," their outputs should agree at a meaningful rate. When they don't, at least one of them is measuring something other than contract risk. Applied to ML: does your model agree with expert judgment, with alternative automated measures, and with downstream behavioral indicators where the concept should predict behavior? Systematic disagreement is a validity signal, not noise.

Discriminant validity asks whether your measure fails to correlate with things it shouldn't. A model trained to classify aggressive legal arguments should not be substantially predicted by document length, formatting conventions, or the identity of the authoring firm — none of which are constitutive of argumentative aggression. When a model's outputs are heavily predicted by features that have no theoretical relationship to the target concept, the measure is picking up a spurious correlate, not the construct itself.

Taken together, these three tests ask: does this label mean what we say it means, does it agree with independent evidence that it should agree with, and does it fail to pick up things it shouldn't pick up? Standard ML evaluation asks none of these questions. It asks whether the model predicts the labels. Whether the labels are valid is assumed.

Measurement invariance

A measure has measurement invariance if it captures the same construct in the same way across different groups, contexts, and conditions. This is not the same as asking whether the measure has the same average value across groups — it is asking whether the underlying measurement process is equivalent.

A sentiment classifier trained on English-language product reviews may not measure sentiment equivalently when applied to French-language reviews, professional communications, or user-generated content from different demographic groups. The model has not changed. The data-generating process has. If the relationships between surface features and the latent construct differ across groups or contexts, the model is measuring different things in different places, and comparisons across those groups or contexts are not valid.

Social scientists test for this using measurement invariance analysis — examining whether the factor loadings, intercepts, and residual variances of a measurement model hold across groups. The ML equivalent involves examining whether model performance, error rates, and prediction distributions are consistent across the subgroups and contexts the model will encounter in production. This is related to, but more fundamental than, the fairness auditing that ML practitioners do conduct — fairness auditing typically examines outcome disparities, while measurement invariance asks whether the underlying construct is being measured equivalently in the first place.

The practical consequence of ignoring measurement invariance is a system that appears to work uniformly but produces outputs that mean different things depending on who or what is being measured. For applications in hiring, lending, legal analysis, or clinical decision support — anywhere the outputs affect people in consequential ways — this is not a theoretical concern.

Identification and confounding

Social scientists studying causal questions face a structural problem: the data they have access to is observational. The treatment was not randomly assigned. The outcome is correlated with many things, and disentangling the causal effect of any one variable from the correlational noise requires making assumptions — about what was controlled for, what was not, and what the data-generating process looked like.

This is precisely the problem that supervised ML faces, and ML practice is considerably less careful about it than social science practice. A model trained on observational data learns the correlational structure of that data. If the target label is confounded with features that have no causal or definitional relationship to the concept — if "high-risk contract clause" is correlated with document length because long contracts happen to contain more risk in the training corpus — the model will learn the confound. It will work well on data from the same distribution. It will fail, systematically and in predictable ways, on data where the confound does not hold.

Social scientists have developed a rich toolkit for thinking about identification: instrumental variables, regression discontinuity, difference-in-differences, propensity score methods. Most of these do not translate directly to supervised ML. What does translate is the habit of asking: what assumptions are required for this analysis to be valid, and what happens to the conclusions if those assumptions are violated? A model trained to predict contract risk from a corpus where risk correlates with client industry is making an implicit assumption that the correlation is substantive. Testing that assumption — by examining whether the model still performs when the industry confound is removed — is the ML equivalent of a sensitivity analysis. It is not standard practice. It should be.

Scope conditions

Every empirical claim has a domain of applicability. In quantitative social science, stating scope conditions explicitly is a methodological norm (which is not to say that everyone does it; but good research does): findings hold under specified conditions — for certain populations, time periods, institutional contexts — and claims that exceed those conditions are considered overreach.

ML models have scope conditions too. They are rarely stated. A document classifier trained on ECJ preliminary rulings from 2010–2020 has implicit scope conditions: it was trained on documents from a specific court, in a specific procedural context, during a specific period when EU legal doctrine was in a particular state. Deploying it on General Court judgments from 2024, or on national court decisions applying EU law, exceeds those scope conditions in ways that may or may not matter depending on how much the relevant features have changed.

The habit of stating scope conditions is not just about intellectual honesty. It is about knowing in advance where a model is likely to fail. A model with stated scope conditions can be monitored against those conditions — its performance can be tracked as production data drifts toward the boundaries of its validity domain. A model with unstated scope conditions fails unexpectedly, because nobody defined in advance what "unexpected" means.

Why engineering training doesn't emphasize this

This is not a criticism. It is an explanation.

Computer science and software engineering training optimizes for building systems that work. The central questions are: does the code execute correctly, does the system scale, does the architecture support the required functionality? These are the right questions for a large class of problems. They are necessary but not sufficient for problems that involve making inferences about complex human phenomena using imperfect measures.

The questions that measurement theory, causal inference, and scope condition analysis address — does the measure capture the right concept, for the right population, under the right conditions — belong to a different disciplinary tradition. That tradition developed because the phenomena it studies do not have ground truth in the engineering sense. There is no correct label for "legal risk" or "political instability" or "patient deterioration" waiting to be verified. The construct has to be defined, the measurement has to be validated, and the scope has to be specified. Engineering training does not prepare practitioners for this, not because it is inadequate, but because it was designed for a different class of problems.

ML is increasingly being asked to address problems that fall outside that class. The methodological gap is a consequence of the field's success, not its failure.

What this looks like in practice

Construct validity in practice means writing a labeling guide that defines the target concept with enough precision that annotators can apply it consistently, examining annotator disagreements as validity probes rather than noise, and asking whether a model that learned the label perfectly would actually solve the problem. It means running convergent validity checks — comparing model outputs against expert judgment and alternative automated measures — before declaring a system ready for production.

Measurement invariance in practice means stratifying evaluation across the subgroups and contexts the model will encounter in production, not just holding out a random sample of the training distribution. It means asking, for each subgroup, whether the model's error patterns are consistent with measuring the same construct equivalently, or whether they suggest the model has learned different things in different contexts.

Identification in practice means asking what features the model is actually using to make predictions — not just which features are nominally available, but which features are doing the predictive work — and whether those features have a defensible relationship to the target concept. It means running ablations not just to improve performance but to test whether the model would still work if the likely confounds were removed.

Scope conditions in practice means writing down, before deployment, the conditions under which the model is expected to perform reliably and the conditions under which it is not. It means building monitoring around those conditions so that performance degradation is detected when the production distribution drifts toward the boundaries of the validity domain.

None of this is hard. It is disciplined thinking about measurement and inference, applied to systems that are increasingly making consequential decisions. The tools exist in a literature that ML practitioners rarely read. The cross-pollination is overdue — and the production failure rate suggests that the cost of deferring it is not trivial.

The gap between technical ML capability and rigorous application is where most production failures originate — and where research training in quantitative methods makes the most practical difference. If you want to discuss how these ideas apply to a specific project, get in touch or read more about how I work.

How Quantitative Social Scientists Can Contribute to ML Projects

Four concepts that transfer directly

Construct validity

Measurement invariance

Identification and confounding

Scope conditions

Why engineering training doesn't emphasize this

What this looks like in practice

You Have the Data. Now What?

The Question Before the Model: Research Design for ML Practitioners

Why General-Purpose Language Models Struggle with Legal Text

From Problem Framing to Production.