Here is a dirty secret the AI industry does not want you to dwell on: the numbers you are seeing in every press release, every X post, every company blog — they are not as solid as they look. On June 12, 2026, Epoch AI quietly shipped FrontierMath v2, an error-corrected version of one of the most respected benchmarks in the field. The correction note revealed that 42% of the original FrontierMath problems contained small but critical errors. Let that number sink in for a moment: two out of every five test questions were wrong.
Six Sigma or Six Percent?
What should a skeptical reader take away from all this? First, do not treat benchmark scores as comparable across different test suites. A 90% on Benchmark A is not remotely the same as 90% on Benchmark B. Second, look for human evaluation results and real-world deployment data, not just automated scores. Third, when a lab publishes a new benchmark result, ask whether they designed the benchmark themselves — self-evaluation is not evaluation.
The FrontierMath correction did not reveal a scandal in the sense of deliberate fraud. It revealed something subtler and more systemic: the industry has built an evaluation apparatus that is structurally noisy, and we have all been treating noise as signal. The numbers were not fake — they were just less meaningful than advertised.
The uncomfortable truth is that academic benchmarks serve two masters: science and marketing. When billions of dollars sit on a press release, the marketing function tends to win. Frontier labs know this. That is why their internal evaluation pipelines rely on human raters and real-world task completion, not multiple-choice datasets. The benchmarks you see in the news are for a different audience entirely.
How to Read Benchmark Claims Without Getting Played
The correction changed 135 problems and removed 12 entirely. The dataset dropped from 350 to 338 questions. Models saw score bumps across the board. Rankings stayed roughly the same. So what is the problem? The problem is that before the correction, any engineering team that picked a model based on FrontierMath scores — any startup that allocated budget, any CTO who justified an architecture decision — was making that call based on a test that was quietly wrong about nearly half its items.
The uncomfortable truth is that academic benchmarks serve two masters: science and marketing. When billions of dollars sit on a press release, the marketing function tends to win. Frontier labs know this. That is why their internal evaluation pipelines rely on human raters and real-world task completion, not multiple-choice datasets. The benchmarks you see in the news are for a different audience entirely.
- Humanity Last Exam has 30% wrong answers. HellaSwag has 36% errors. If your model is state of the art on a broken test, what have you actually achieved?
- When billions of dollars ride on a number, teams optimize for the number, not the capability. That is not cheating. That is rational behavior in a broken system.
The models are getting smarter. The benchmarks are not keeping pace. And until the industry builds evaluation systems that measure what we actually care about — not what is easiest to score — every state of the art claim deserves a raised eyebrow, a grain of salt, and a very careful second look.
Comments