Can You Tell If Text Was Written by an LLM?

Influencers on LinkedIn are always rediscovering that ChatGPT frequently uses em-dashes and proclaim that this is how you can tell if text was written by an LLM. It always goes the same way. The post gets shared widely. People start scrutinizing each other's writing for em-dashes. Someone points out that good writers have always used em-dashes. (They are correct.) Someone else points out that the original post was probably LLM-generated. (They are also probably correct.) The discourse moves on, and then repeats, having produced no insights.

It is — in a word — insufferable.

The "em-dash debate" (it is not always em-dashes — but it often is) is a symptom of people trying to address a difficult problem with "folk methods" that do not work. Whether or not we can tell that text was generated by an LLM is an important methodological question — with real stakes when it comes to academic and professional integrity. The answer is, of course, considerably more complicated than monitoring punctuation use.

(And the influencers know that. But they do not care — they got your impression.)

Why surface features do not work

The intuition behind feature-based detection is straightforward: LLMs have stylistic tendencies, those tendencies produce detectable patterns, detecting the patterns identifies the text. Em-dashes, hedging phrases, certain sentence structures, an unusual density of transitional language — all of these have been proposed as signals.

(It turns out that good writing tends to exhibit certain features, and LLMs are quite good at learning them.)

The problem is that this is a classification task with a moving target.

LLM outputs are not drawn from a fixed distribution. They vary substantially by model, by prompt, by temperature setting, by the style of the input they are responding to, and by any post-generation editing done by the user. (Always edit your LLM-generated content. That way, at minimum, you know what you said.) A stylistic pattern that is a reasonable diagnostic for ChatGPT responses is not necessarily a reasonable diagnostic for ChatGPT responses to domain-specific prompts, or for Claude, or for Gemini, or for any of the many other models that will be released after the cut-off for whatever training data the detector was built on. Detectors trained on one model's outputs generalize poorly to others and degrade quickly as models are updated.

Human writing also does not have a fixed distribution. (That is generally true of anything produced by humans.) Stylistic tastes drift over time and across contexts. Em-dashes are commonly used by some writers and rarely used by others. Hedging language is characteristic of academic writing, regardless of who produced it. Certain sentence structures appear frequently in legal writing, in financial analysis, or in technical documentation — not because those texts were LLM-generated but because those registers have their own conventions. A detector trained on general text will systematically produce false positives on domain-specific professional writing, which may happen to look more like LLM output on surface features than casual prose does.

LLM-detectors are unreliable in exactly the situations where reliability matters most.

What the research shows

Automated LLM detection is the subject of a lot of research, and the findings are not encouraging for anyone who wants a reliable detector. The best-performing detectors — watermarking approaches aside — achieve accuracy that looks reasonable on held-out samples from the training distribution and degrades substantially under distribution shift, paraphrasing, and cross-model evaluation.

Watermarking is a promising technical solution: some model providers can embed statistical signals in generated text that are detectable by someone who knows the key. The limitation is that watermarking requires the cooperation of the model provider, is absent from responses that have been substantially edited, and provides no signal for text generated by models that do not implement it — which includes every open-weight model and every API that does not participate. (So, nearly all of them.)

Zero-shot detection methods — using a language model to evaluate the likelihood of a text under its own distribution, on the theory that LLM-generated text will have higher likelihood than human-written text — can work under controlled conditions but fail under realistic ones. Perplexity-based methods are sensitive to domain, register, and writing quality in ways that produce unacceptable false positive rates on professional writing.

The upshot is that reliable, general-purpose LLM detection — accurate across models, robust to editing, applicable without ground truth — does not currently exist. Researchers may develop new methods that make this problem more tractable. Do not count on it.

The base rate problem

There is a further issue that the em-dash discourse never addresses: base rates.

Suppose a detector achieves 90% accuracy — 90% true positive rate on LLM-generated text and 90% true negative rate on human-written text. This sounds useful. Now suppose the actual prevalence of LLM-generated text in the population you are examining is 10%. Running the detector on 1,000 texts, you have 100 that are LLM-generated and 900 that are not. The detector correctly identifies 90 of the 100 LLM texts and incorrectly flags 90 of the 900 human texts. You get 180 positives, half of which are false. The precision of the detector — the probability that a flagged text is actually LLM-generated — is 50%. (That is . . . not good.)

A detector with 90% accuracy, applied in a context where LLM use is not the majority behavior, is a coin flip on positive predictions. For a use case like detecting plagiarism, where the consequence of a false positive is accusing someone of cheating who did not, this is not a useful tool. The base rate problem is not a fixable calibration issue. It is a structural property of using classifiers in low-prevalence contexts.

(Side note: If you are an educator, never rely on tools that claim to detect LLM use. You cannot trust them, and using them can do real damage. Also, your students are using LLMs. You are going to have to deal with that — by changing your approach to assignments and exams.)

What you can actually do

None of this means the question is unanswerable in every context. It means that reliable detection requires more than pattern-matching on surface features.

Behavioral evidence — comparing a person's submitted work against earlier samples from the same person, examining consistency of voice and knowledge across a person's body of work, looking for discontinuities in a person's writing quality — is far more informative than any single-document classifier. It is also more labor-intensive and requires baseline data that is often not available. And it requires you to have good judgment. (Which you may have.)

For high-stakes applications, like assessments, a better approach is to change the incentive structure rather than improve detection. Assessments that require real-time demonstration of understanding, that are personalized enough that LLMs are not useful, or that involve process documentation alongside final output are more robust to LLM use than detection-after-the-fact. You will have to adapt your approach to assessments to the new LLM reality, and that can be a lot of work. You also need to think hard about fairness and bias. (For example, many people are bad at oral exams. That does not mean they are not competent.)

The broader point

The em-dash debate is frustrating not because it is wrong about em-dashes — although it is wrong about em-dashes — but because it reflects a broader tendency to treat a hard classification problem as though it were a pattern-recognition exercise that anyone can do by eye. It is the same error that produces confident claims about forged documents, fabricated images, and manipulated audio based on visual inspection of artifacts that turn out to have innocent explanations.

LLM detection is a serious technical problem without a clear answer: it is partially solvable in controlled conditions, but not reliably solvable in general. The "folk methods" that circulate on social media do not solve it. Ignore them. In situations where real consequences follow, turning to these methods is far worse than simply acknowledging that this is a hard problem.

Writing is hard, and LLMs make it easier. People are going to use them (even good writers) — the efficiency gains are too great to ignore. Standards of originality will have to adapt to a reality in which LLMs, which are trained on other peoples' work (often without permission or compensation — another problem), are doing much of the actual writing. What "originality" means is going to change. Good ideas and good taste will be the differentiators. But the aphorism that "writing is thinking" is true, and the danger with cutting too many corners is that the quality of your ideas will suffer. But that's your problem, not the LLMs.

How something was written — and how much credit you get to claim for "writing" it — will matter less than what it says. That, of course, was always the true standard of good writing, anyway: Does it say something worth reading?

(I used Claude to write this post.)

The broader question of what language models actually do — and what that implies for when to trust their outputs — comes up in almost every applied NLP project. If you are thinking through these issues for a specific application, get in touch.

Can You Tell If Text Was Written by an LLM?

Why surface features do not work

What the research shows

The base rate problem

What you can actually do

The broader point

Why General-Purpose Language Models Struggle with Legal Text

Making a Monolingual Model Bilingual with Domain Adaptation

Your Model Learned the Wrong Distribution

From Problem Framing to Production.