AI detection tools are everywhere now — Turnitin, GPTZero, Originality.ai, and dozens more. But how do they actually work? What do they measure? And how accurate are they really? This article cuts through the hype with a clear, honest look at the technology.
Most AI detection tools rely on two core statistical signals derived from language model research.
Perplexity measures how "surprising" a piece of text is to a language model. When an AI writes text, it naturally chooses high-probability, low-surprise word sequences — the words that statistically fit best in context. This produces low-perplexity text. Human writers, by contrast, make unexpected word choices, use unusual structures, and take creative risks — producing higher-perplexity text overall.
Detection tools measure this: low perplexity suggests AI authorship; high perplexity suggests human authorship.
Burstiness refers to variation in sentence length and complexity. Human writing is "bursty" — we write long complex sentences, then short punchy ones, then medium ones. AI writing tends to be uniform: each sentence is roughly the same length and complexity, creating low burstiness. Detectors measure this variation (or lack of it) as an additional signal.
| Tool | Primary audience | Method | Reported accuracy |
|---|---|---|---|
| GPTZero | Educators | Perplexity + burstiness scoring | ~85% on clean AI text |
| Turnitin AI | Universities | Proprietary ML + similarity | ~98% claimed, debated |
| Originality.ai | Publishers, SEO | Fine-tuned detection model | ~94% on GPT-4 content |
| Copyleaks | Enterprises | Multi-model detection | ~99% claimed |
| ZeroGPT | General public | Text analysis heuristics | Inconsistent in testing |
Note: Accuracy figures are self-reported or from limited testing. Real-world performance on edited or humanized text is significantly lower for all tools.
This is the most important and least-discussed issue with AI detection: false positives are common and consequential.
Research has consistently shown that certain types of human writing score very high on AI detection tools — not because they were AI-generated, but because they happen to share stylistic traits with AI output:
A 2023 study found that GPTZero flagged over 50% of essays written by non-native English speakers as AI-generated. This is a serious problem when detection results are used to make academic misconduct decisions.
"AI detectors don't detect AI — they detect writing that resembles AI. That's a crucial distinction, and most institutions aren't making it clearly enough."
Several factors consistently make AI-generated text harder for detection tools to catch:
AI detection is fundamentally an arms race that detection tools are losing. As language models improve, their output naturally becomes harder to distinguish from human writing. Each new model generation produces more varied, nuanced text — and the detection tools trained on older model outputs are less accurate on new ones.
OpenAI released and then quietly retired its own AI classifier in 2023, noting it was "not reliable enough." Anthropic has not released a detection tool, citing similar concerns about accuracy.
If you're a writer using AI tools for any purpose — whether that's drafting, editing, research, or brainstorming — the most important things to know are:
Humanizor rewrites AI text to have natural variation, specific vocabulary, and human rhythm. Free, no sign-up required.
✦ Try Humanizor free