At some point in the life of most data-rich organizations, someone says some version of the following: "We're sitting on years of internal data. We should be doing something with AI."
They're not wrong. Years of support tickets, contracts, case files, customer interactions, or clinical notes represent a genuine asset. Modern ML methods are remarkably capable. The combination should produce something useful.
Often it doesn't — or at least not as useful as it should. And the reason is almost never the data or the model. It's that nobody answered a more fundamental question before the project started: what, specifically, are we trying to know, and why does it matter?
Data availability is not problem readiness
These are different things, and the distance between them is where most internal ML initiatives quietly stall.
Having data tells you what you can potentially measure. It tells you nothing about what you should measure, whether that measurement connects to a decision anyone needs to make, or whether the labels you'd construct from that data actually capture the concepts you care about. An organization with ten years of support tickets has the raw material for an ML project. It does not automatically have a well-defined ML problem.
The instinct to start with the data is understandable. The data is concrete and available. The problem definition feels like something you'll figure out as you go. But in practice, starting with the data and working backward to the problem produces a specific and predictable failure mode: a model that is technically functional and organizationally irrelevant.
Here's a version of that failure I've seen more than once. A company has accumulated several years of customer support tickets. They hire a contractor to build a classification model that automatically categorizes incoming tickets. The model achieves 89% accuracy. It gets deployed. Six months later, almost nobody on the support team is using it.
The problem wasn't the model. The problem was that the classification categories were defined by what was easy to label from historical data — broad topic areas that made sense for reporting purposes — rather than by how the support team actually decided to route and prioritize tickets. The model answered a well-specified question that turned out not to be the question anyone needed answered.
The research plan
What would have caught this? A research plan — and not an elaborate one. The discipline of answering four questions in writing before touching the data.
What decision will this system support?
Not "what can we predict" but "what will someone do differently because of this output?" This question forces the conversation from the abstract to the operational. If the answer is "it will help the support team route tickets faster," the next question is: how does the support team currently decide how to route tickets, and what information do they need to do it? The classification categories that fall out of that conversation may look very different from the categories that fall out of a historical data analysis.
This is the question most often skipped, and its absence explains most of the gap between technical success and organizational impact.
What does success look like in production?
Not benchmark performance — what changes in the business when the system is working? This question surfaces the evaluation criteria that actually matter. A model that achieves 92% accuracy on a balanced test set may be nearly useless if the cases it gets wrong are disproportionately the high-stakes ones. A model that achieves 84% accuracy but dramatically reduces the time a human expert spends on routine cases may be transformative.
Defining production success before training begins also gives you something to evaluate against once the system is deployed — which is how you know whether it worked.
What are we actually labeling, and does that correspond to what we care about?
This is the measurement question, and it's the one with the most technical depth. Every supervised ML model learns from labels. Labels are operationalizations — decisions to treat some observable property of the data as evidence of an underlying concept. The model will learn whatever the labels capture, not whatever you intended them to capture.
If you're labeling contract clauses as "high risk" or "standard," you need a precise definition of what makes a clause high risk — one that your annotators can apply consistently and that corresponds to the concept your lawyers actually care about. If that definition is fuzzy, annotators will operationalize it differently, and the model will learn a mixture of their individual implicit criteria rather than the construct you intended.
The right question to ask at this stage: if the model learned this label perfectly, would it solve our problem? If the answer is "not quite," the label design needs work before annotation begins.
How will we know if it's working in production?
This question is often deferred to "after deployment," which means it sometimes never gets answered at all. But the monitoring and evaluation strategy for a production system follows directly from the definition of production success — and it needs to be designed before the system goes live, not retrofitted afterward when something has already gone wrong.
This includes: what metrics will you track, at what cadence, with what alerting thresholds? What does model drift look like in your context, and how will you detect it? Who is responsible for evaluating ongoing performance, and what triggers a retraining cycle?
What this looks like in practice
A research plan doesn't have to be long. The four questions above, answered specifically and in writing, fit on one page. The discipline isn't about documentation — it's about the conversations the questions force.
In my experience, the most valuable output of a research plan process isn't the document itself. It's the disagreements it surfaces. When a data scientist, a product manager, and a domain expert sit down to answer "what decision will this system support," they frequently discover that they have been assuming different answers. The ML project that proceeds without that conversation builds in those disagreements as structural ambiguity — which surfaces later as a model that different stakeholders evaluate against different criteria and that nobody is fully satisfied with.
The research plan makes the implicit explicit before it becomes expensive to change.
The data is not the constraint
I work with organizations that have accumulated large internal corpora — legal documents, clinical records, financial filings, technical support histories. In almost every case, the data is not the binding constraint on what's possible. The binding constraint is clarity about what the organization is trying to learn from it.
Modern ML methods can extract a great deal of signal from imperfect data. They cannot compensate for a poorly framed problem or a label that doesn't correspond to the concept that matters. The investment in getting the research design right — defining the decision, specifying the measurement, designing the evaluation — pays off in systems that work operationally, not just technically.
The question "we have all this data, what should we build?" is worth sitting with longer than it usually gets. The answer shapes everything that follows.
If your organization is working through this question — you have the data, you know ML could help, but you're not sure how to define the problem or scope the project — that's exactly what an ML Discovery or ML Assessment engagement is designed to address. Let's talk.