The Question Before the Model: Research Design for ML Practitioners

The most common failure mode in applied machine learning is not a bad model. It is a good model built to answer the wrong question.

This happens more often than the field's evaluation culture suggests, because evaluation culture is organized around answering the question: given a problem specification, how well does this model solve it? The prior question — is this the right problem specification? — falls outside the scope of most benchmarks, papers, and engineering processes. Someone has to answer it before the ML work begins, and in most organizations, nobody is formally responsible for doing so.

Social science methodology has a name for this prior work: research design. And it has spent several decades developing frameworks for doing it rigorously. Those frameworks transfer directly to applied ML — not as academic overhead, but as practical tools for avoiding expensive mistakes.

What Research Design Actually Is

Research design is the set of decisions that determine what question you are actually answering, as opposed to the question you think you are answering.

It covers: How is the outcome of interest defined and operationalized? What is the unit of analysis? What is the comparison or counterfactual? What are the scope conditions — under what circumstances should the findings generalize, and where should they not? What are the assumptions required for the analysis to be valid, and how sensitive are the conclusions to violations of those assumptions?

In academic social science, these questions are answered explicitly, in writing, before data collection begins. The discipline is imperfect in practice, but the norm exists because decades of experience showed what happens when it doesn't: studies that answer precisely specified questions that nobody needed answered, findings that fail to replicate because the design assumptions didn't hold, and interventions that didn't produce their intended effects because the theoretical model connecting the measure to the outcome was wrong.

Applied ML is now accumulating the same institutional knowledge, more slowly and more expensively.

Three Questions That Should Precede Model Selection

1. What decision does this model need to support, and what does "better" mean in that context?

The answer to this question determines whether you need a classifier or a ranking system, whether precision or recall matters more, what the cost of different error types is, and what "good enough" means. It also sometimes reveals that a model is not the right tool at all — that the decision could be better supported by a simpler system, a human process, or a clearer policy.

This question is harder to answer than it sounds. "We want to identify high-value customers" is not an answer. High-value by what measure, over what time horizon, for what purpose? The answers determine what you label, how you evaluate, and what the model needs to do to be useful. Organizations that skip this question tend to end up with models that optimize a proxy metric while the actual business outcome they care about moves independently.

2. What are you assuming about the relationship between what you can measure and what you care about?

ML models learn from labels. Labels are operationalizations of constructs — they translate a theoretical concept into something measurable. The quality of that translation is a design question, not a modeling question, and it determines the ceiling on what the model can possibly achieve.

If you are building a model to detect "customer dissatisfaction" and your labels are drawn from 1-star reviews, you have operationalized dissatisfaction as "dissatisfaction that reaches the threshold of a 1-star review." That is not the same construct as dissatisfaction generally — it selects on severity, willingness to take action, and probably several demographic factors. A model trained on those labels may be excellent at identifying a specific subtype of dissatisfaction while missing the broader construct entirely.

The right question is not "what labels do I have?" but "what labels would I need to validly measure the construct I care about, and how close are my available labels to that ideal?"

3. What are the scope conditions, and are they likely to hold in production?

Every model has scope conditions — the circumstances under which it should generalize. A classifier trained on English-language product reviews may not generalize to other languages, other product categories, or reviews written more than two years after the training data was collected. A fraud detection model trained on one channel may not transfer to another. A recommendation system trained on engagement data may not reflect what users actually value.

Scope conditions are usually implicit. Making them explicit — as a deliberate design step before training begins — forces several useful realizations: whether the training data covers the production distribution, where generalization claims are and aren't justified, and what monitoring is required to detect when scope conditions stop holding.

The Practical Implication

None of this requires slowing down ML projects with academic overhead. It requires adding one structured conversation at the beginning of each engagement — before data collection, before labeling, before model selection — that answers the three questions above.

That conversation is often the most valuable hour in a project. It surfaces disagreements about what the model is for. It identifies operationalization choices that would have produced misleading labels. It establishes the scope conditions that need to be monitored in production. And it sometimes reveals that the proposed ML solution is the wrong answer to the actual question — which is a much cheaper discovery to make at hour one than at month three.

The discipline of asking these questions before writing code is not a social science quirk. It is the most effective intervention I know for improving the expected value of applied ML investments. The models that fail in production almost always have a research design failure upstream of the technical work. The models that succeed almost always reflect a clear prior answer to the question: what, exactly, are we trying to know, and why?

A Note on Organizational Implementation

The structural challenge is that research design questions are easy to skip when the incentives reward shipping models rather than framing problems. ML teams are typically evaluated on whether they built the thing they were asked to build, not on whether the thing they were asked to build was the right thing to build.

Changing this requires either embedding the research design function in the ML team — which is an argument for hiring people with social science training, not just engineering training — or creating a formal checkpoint in the project lifecycle where these questions have to be answered before development begins. The second is easier to implement. It requires a one-page document, a conversation, and the organizational will to treat a poorly framed problem as a blocker rather than an acceptable starting point.

That shift in process produces better models. It also produces better relationships with business stakeholders, because it forces a conversation about what success actually looks like before any investment is made — rather than after, when the model is built and the goalposts have moved.

The Question Before the Model: Research Design for ML Practitioners

What Research Design Actually Is

Three Questions That Should Precede Model Selection

The Practical Implication

A Note on Organizational Implementation

How Quantitative Social Scientists Can Contribute to ML Projects

You Have the Data. Now What?

Why General-Purpose Language Models Struggle with Legal Text

From Problem Framing to Production.