There is a pattern that has become almost routine in enterprise AI adoption. An organization decides it needs to do something with its documents — classify them, extract information from them, summarize them. Someone suggests using GPT-4. Nobody pushes back, because GPT-4 is what everyone has heard of. The project proceeds, the costs come in higher than expected, the outputs require more human review than anticipated, and the team spends significant time writing and refining prompts that will need to be rewritten again when the model is updated.
This is not always the wrong choice. Sometimes it is exactly the right one. But the decision is rarely made deliberately, and that is a problem — because the choice of model architecture is one of the most consequential decisions in an NLP project, with significant implications for cost, accuracy, latency, and maintainability.
Two fundamentally different approaches
GPT-4 and its contemporaries are generative models. Given a prompt, they produce text. This is a genuinely remarkable capability, and for tasks that require synthesis, flexible reasoning, or open-ended generation — drafting, question answering over heterogeneous sources, summarizing documents where the relevant content varies unpredictably — generative models are often the right tool.
BERT-style models (BERT, RoBERTa, DeBERTa, and their domain-adapted variants) are discriminative models. They do not generate text. They assign probabilities to classes, extract spans, match sequences, or produce embeddings. For tasks that can be precisely specified — classify this paragraph, extract these entities, determine whether these two passages are semantically similar — discriminative models are typically more accurate, dramatically cheaper, and considerably easier to audit than generative alternatives.
The distinction matters because a large share of practical NLP work falls into the second category. Classification, named entity recognition, information extraction, span detection, semantic similarity — these are well-defined tasks with well-defined outputs. Solving them with a generative model is a bit like using a search engine to answer a question that a lookup table would handle in microseconds. It works. It is not the right tool.
The cost argument
The economics are not subtle. A BERT-style classifier running on modest hardware can process tens of thousands of documents per hour at negligible marginal cost. GPT-4 API calls for the same volume run to hundreds or thousands of dollars, with latency that makes real-time applications impractical for many use cases.
For a legal tech application processing EU court filings at scale — tagging argument types in preliminary ruling requests, extracting cited provisions from judgments, classifying procedural stages across thousands of cases — the cost difference between a fine-tuned BERT model and a GPT-4 pipeline is not marginal. It is the difference between a system that is economically viable to run continuously and one that requires careful rationing of which documents get processed.
The financial cost has a less-discussed analogue. Frontier models run on data center infrastructure with substantial energy and water requirements. Routing a classification task that a 110-million-parameter BERT model handles in milliseconds through a frontier model instead carries a real environmental cost — one that scales directly with volume. Using the most powerful model available is not a neutral default. For organizations with sustainability commitments, it is worth asking whether those commitments extend to infrastructure choices in ML pipelines.
This matters more as applications mature. Proof-of-concepts can absorb high per-query costs, financial and otherwise. Production systems, run continuously against growing document corpora, cannot.
The accuracy argument
Generative models hallucinate. This is not a bug that will be patched in the next release — it is a structural property of how these models work. They produce fluent, plausible text. That text is not always accurate, and the model's confidence in accurate and inaccurate outputs is often indistinguishable.
For extraction tasks, this is a serious problem. Ask GPT-4 to extract all cited treaty articles from a Commission decision and it will return a plausible-looking list. Some items on that list may be citations that do not appear in the document. Some citations that do appear may be missing. The model will not flag either type of error — it will present the list with the same fluency it brings to everything else.
The situation with frontier reasoning models is, if anything, more counterintuitive. Models like o1 and its successors are genuinely impressive on complex multi-step reasoning tasks. They are also, in practice, worse than standard language models on many structured extraction and tagging tasks. Reasoning models are optimized to think through problems step by step — a useful property when the task is open-ended, a liability when the task requires strict adherence to a constrained output format. They have a tendency to paraphrase, reinterpret, and elaborate where a tagging task requires only that they follow instructions precisely. More capability does not mean better performance on every task.
A fine-tuned span extraction model, trained on annotated examples of treaty citations in Commission decisions, does not hallucinate. It identifies spans in the source text or it does not. The outputs are auditable by construction: every extracted entity can be traced back to a specific position in the document. For any application where provenance matters — legal research, compliance, regulatory analysis — this is not a minor advantage.
Another issue is that many frontier generative LLM models are propriety — their behavior can change at any time. And sometimes unexpectedly, surprising even their owners. A proprietary LLM could perform well at a task one day, and then poorly the next. That's a business risk you don't need.
The hallucination problem is worse in specialized domains
General-purpose LLMs are trained on general-purpose text. Their knowledge of EU competition law, ECHR jurisprudence, or WTO dispute settlement procedure is whatever happened to appear in their training corpus — unverified, unsystematically sampled, and frozen at the training cutoff.
A domain-adapted BERT model trained on a curated corpus of EU legal documents has a different relationship to that material. It has been exposed to the specific terminology, citation conventions, and argumentative structure of that corpus in a controlled way, and its representations of those concepts reflect actual usage rather than whatever a general web crawl happened to contain. It will not confabulate a citation to a directive that does not exist, because it is not in the business of confabulating anything.
This is the core argument for domain adaptation over prompting for high-stakes legal NLP tasks. The question is not which model has read more text. It is which model's architecture is appropriate for the task and which model's training is appropriate for the domain.
The privacy argument
Sending documents to a frontier model API means sending them to someone else's infrastructure. For legal documents — client communications, draft contracts, confidential filings, internal legal strategy — this is not a hypothetical concern. It is a data governance question with real professional responsibility implications.
A fine-tuned BERT model runs locally. A domain-adapted classifier processing EU court judgments or internal contract reviews can run on a MacBook Pro or a private server and never expose a single document to an external API. The data stays where it belongs. There is no terms-of-service question about whether submitted text may be used in future training. For law firms, legal tech companies, and any organization handling privileged or proprietary material, this distinction is not minor.
LLMs and BERT models are not always competitors
One use case where frontier models and BERT-style models work well together is worth naming explicitly: synthetic data generation.
Annotating training data for a specialized classification task is expensive. Domain experts are scarce and their time is valuable. One approach that has become increasingly practical is using a frontier LLM to generate synthetic labeled examples via zero-shot or few-shot prompting — producing a first-pass training corpus that a BERT model can then be fine-tuned on. The LLM handles the generation; the smaller discriminative model handles production inference, where cost, latency, and auditability matter.
This is not a universal solution — synthetic data has its own validity challenges, and the label-is-a-hypothesis problems discussed elsewhere apply here too. But it is a legitimate way to reduce annotation costs on tasks where ground-truth examples are sparse, and it illustrates that the choice is not always binary. The tools serve different purposes and can be combined deliberately.
When to use which
Generative models are the right choice when the task genuinely requires generation: drafting, synthesis across heterogeneous sources, answering questions where the relevant information is not localized in a specific span, or handling tasks whose structure varies enough that a discriminative model would require constant retraining. They are also appropriate for low-volume applications where the flexibility of a prompted model reduces development time enough to justify the cost.
BERT-style models are the right choice when the task is well-defined, the output is structured, the volume is high, the domain is specialized, and accuracy is important enough that hallucination is not acceptable. Classification, NER, information extraction, semantic search, and document similarity all qualify. For legal document analysis at any serious scale, most tasks qualify.
The AI discourse is dominated by announcements of newer, more powerful, more expensive models. That discourse is not a good guide to engineering decisions. The right question is not which model is most impressive — it is which model is most appropriate for the task, the data, the privacy requirements, and the production constraints. In a surprising number of cases, the answer is the fine-tuned BERT variant that most people stopped talking about in 2022. The AI influencers moved on. The use cases did not.
Choosing the right model architecture for a legal NLP application is a design decision, not a default. If you are evaluating approaches for a document processing pipeline and want a second opinion on the tradeoffs, let's talk.