The science of measuring AI
The Stopwatch and the Exam
Static benchmarks did not die. They stopped being enough. A grounded look at the move from capability to agency, the harness problem underneath it, and where attention belongs now.
A model can score 95% on a coding benchmark and still fail the kind of messy, multi-hour task a developer would actually be paid to finish. That gap, between the number on the leaderboard and the work you would trust the system to do, has quietly become the center of AI evaluation.
For most of the last decade the two were close enough that we could ignore the distance between them. You wrote a fixed set of questions with known answers, ran the model once, and reported a percentage. MMLU, GLUE, GSM8K, HumanEval. Progress was a stack of test sets, and a higher score meant a better model. That arrangement worked while the tasks were small and the answers were unambiguous.
It would be easy, and wrong, to say that era ended. Static benchmarks did not become useless, and anyone who tells you they did is selling something. What changed is narrower and more interesting. The static exam stopped being enough. The frontier has climbed into the ceiling on most of the benchmarks that defined the last few years, and the questions that now decide whether you can deploy an agent, how long it stays coherent, how often it repeats a success, how much it costs, and how it behaves when it fails, are not questions an exam was ever built to ask.
What follows is an attempt to map that shift without overselling it. It rests on the published science of AI evaluation and on a catalog of agentic benchmarks I maintain in the open. Before the first number, the ground rules.
/data.json and /data.csv. I sort every benchmark on four independent axes: how much headroom remains, how well it resists contamination, how comparable its harness is across systems, and how close its task sits to real economic work. Read the percentages below as directional rather than canonical, and follow the link before you cite any of them. Several are self-reported or provisional, and the catalog says so on the row.
One more piece of hygiene before the argument, because the rest of the piece depends on it. Saturation, contamination, and agency are three different problems. A benchmark can be saturated without being contaminated. It can be contamination-proof and still measure the wrong thing. It can be genuinely agentic and badly designed. Most loose commentary blurs the three into a single vibe of "benchmarks are broken." They are not broken in one way. They fail in distinct ways that call for distinct fixes, and keeping them apart is the whole point.
Part IThe old score stopped being enough
A benchmark is useful while the frontier still sits inside it. Once the top few systems bunch within a point or two of the ceiling, it has stopped resolving differences between them and become a smoke test. By that standard a lot of the agentic landscape is already spent. WebVoyager sits near 98.5%. BrowseComp went from roughly 51% to 90% in about a year. Cybench, the standard for autonomous capture-the-flag, is effectively solved, and the lab that long led it has publicly called the benchmark no longer useful and moved to harder cyber evaluations. SWE-bench Verified, the most-cited coding-agent benchmark in the world, now carries vendor numbers around 95%.
The strongest case for the old leaderboard
It is worth steelmanning the thing I am about to qualify, because the case for the classic static benchmark is real and most critiques skip it. Static benchmarks are cheap to run, which means everyone can afford to. They are comparable, the same fixed questions scored the same way across dozens of models. They are longitudinal, so you can watch a capability move over two years on one axis. And they are excellent regression tests, the cheapest way to catch a new model quietly getting worse at arithmetic. None of that goes away. The honest position is not that these benchmarks are worthless. It is that they are necessary and no longer sufficient. Keep them for what they are good at, and stop reading a saturated score as evidence that the underlying job is done. A 95% on a contaminable exam is a fine regression check and a terrible deployment decision.
Contamination is a different failure than saturation
Saturation tells you a benchmark stopped resolving the frontier. Contamination tells you the score was partly fiction to begin with. The mechanism is Goodhart's law: once a benchmark becomes the target everyone optimizes toward, it drifts from measuring the skill to measuring fit to that particular test. Surveys of benchmark contamination (survey, 2024) find test material leaking into training data at rates reported as high as 45%, and paraphrased leakage slips past simple string matching. The cleanest demonstration in the agent world is the Konwinski Prize, a SWE-bench-style coding benchmark built only from GitHub issues filed after the models' training cut-off, which makes contamination impossible by construction. The winning entry scored 7.5%, against the roughly 75% comparable systems report on the older, contaminable SWE-bench Verified.
These two problems often travel together, but treating them as one is how you end up trusting a number twice. Saturation says "this no longer separates the best systems." Contamination says "this score was inflated from the start." Different evidence, different remedy. The remedy for saturation is a harder benchmark. The remedy for contamination is a benchmark the model has never seen.
Part IIWhy agent benchmarks are not just harder exams
The reflex when a benchmark saturates is to write a harder one with the same shape. More obscure questions, trickier math, a longer reading passage. That instinct is what produced frontier exams like GPQA and Humanity's Last Exam, and they are valuable. But the agent shift is not that. It is a change in the kind of thing being measured, not just the difficulty.
An exam scores an answer. An agent benchmark scores a trajectory: many steps, in a stateful environment, using tools, recovering from its own mistakes, judged on whether a job got done. SWE-bench asks an agent to resolve a real GitHub issue against a hidden test suite. OSWorld and WebArena drop it inside an actual operating system or website. τ-bench makes it hold a multi-turn conversation with a simulated user while calling tools and following policy. Success is no longer correct-or-incorrect. It is succeeded, or failed at step seven, or finished half the job.
This matters because of construct validity, the property an instrument has when it measures the abstract thing it claims to. The foundational critique here (Raji et al., 2021) argued that the field routinely treats narrow test sets as if they measured general capability. Agent tasks split the question in two. Internal validity asks whether the benchmark scores performance correctly inside its own world: are the tests right, is the reward honest. External validity asks whether doing well in that world predicts doing well in yours. An exam could mostly dodge both. An agent benchmark cannot, and the place where internal validity quietly breaks is the harness.
Part IIIThe harness problem
This is the part that should change how you read every number that follows, which is why it comes before the map and not after it. In the exam era you benchmarked a model. In the agent era you benchmark a model plus a harness: the scaffolding that gives the model its tools, memory, retry logic, and prompt structure. And the harness often matters more than the model.
The evidence is not subtle. On WebArena, the same underlying model scores roughly 48% raw and around 72% scaffolded, a 24-point swing from scaffolding alone. On SWE-bench Pro, the same benchmark yields about 80% under the vendor's own scaffold and only about 59% under Scale's neutral, standardized harness, a 20-point gap that comes entirely from who built the agent around the model.
It gets worse when you check the reward functions. The most important methodological paper of this cycle (Princeton ABC, NeurIPS 2025) audited widely used agent benchmarks against an Agentic Benchmark Checklist. SWE-bench Verified uses too few hidden tests, so some issues marked "resolved" are not actually solved. τ-bench, under certain conditions, counted empty responses as successes. Across the ten benchmarks they assessed, seven had task-validity problems, seven had outcome-validity problems, and all ten had reporting limitations. The flaws can mis-estimate performance by up to 100% in relative terms. You can watch it happen: strengthen a benchmark's hidden tests and a leading agent drops from roughly 79% to 62%. The instrument had been overcounting by design.
Put the two together and you get the practical rule for this whole field. Agentic results are usually reported without confidence intervals, multiple seeds, controlled cost, or a fixed harness. So a vendor-reported state of the art is best read as an upper bound produced by an optimized harness, not a reproducible measurement. The 20-point gap that appears whenever a neutral party re-runs the benchmark is not noise. It is the harness, made visible.
Part IVThe new axes: time, cost, repetition, value
If the old single number is no longer sufficient, what replaces it is not one better number but several axes that a saturated accuracy score was hiding. Four are worth holding onto.
Time: the stopwatch
The most important conceptual move of the agent era is METR's time-horizon metric (METR, 2025). Instead of asking what fraction of tasks an agent solves, it asks how long a task can be, measured by how long it takes a human professional, before the agent's reliability falls to 50%. This reframes capability as duration, and the resulting trend line is the most-cited in the field: that horizon has been doubling roughly every seven months for six years. At the frontier it is now measured in hours.
What can still go wrong with the stopwatch
The time horizon is the best single idea in agent evaluation, which is exactly why it deserves scrutiny rather than applause. Four limits matter. First, the x-axis is a human estimate, and "how long would this take a professional" is itself noisy and contested. Second, the long end has enormous uncertainty: the catalog's own METR row carries a confidence interval running from about 6 to 98 hours, so any single headline figure hides a wide band. Third, 50% reliability can mask catastrophic failure. An agent that finishes a day-long task half the time is not half-useful if the other half deletes a production database, because some failures are not recoverable and a coin-flip on those is unacceptable. Fourth, duration is a proxy for value, not value itself. Some long tasks are long because they are tedious, not because they are worth much. The stopwatch is the right instrument and it is not a verdict.
Cost and repetition
A single accuracy number hides two variables that decide deployability: what a run costs, and whether it works the second time. On cost, the foundational result (Kapoor & Narayanan, 2024) is that once you control for spend, simple baseline agents often match elaborate ones at a fraction of the cost. Princeton's Holistic Agent Leaderboard puts this into practice with an accuracy-versus-cost frontier rather than a one-dimensional ranking. The spread is real: comparable agentic runs in the catalog range from a few cents to around 87 dollars per task. On repetition, the metric is pass^k, the probability of succeeding on all k attempts. On τ-bench, pass^1 sits well above pass^4, which means agents that look competent once are inconsistent under repetition. For anything you would deploy, pass^k is closer to what you care about than the headline pass@1.
Economic value
The most honest benchmark is the job a professional gets paid to do. GDPval (OpenAI, 2025) covers 1,320 tasks across 44 occupations in the nine largest sectors of US GDP, graded blind by experts comparing the AI deliverable against a human one. Frontier systems win or tie on roughly 41% of tasks, where 50% would be parity. The long-horizon simulations are starker. On a year-long vending-machine business, the best agent nets about 17% of what a competent human operator earns. On TheAgentCompany, agents fully complete only around 30% of multi-step office workflows. And on full enterprise workflows in WorkArena++, success runs near 0 to 2%.
Part VThe 2026 map
Sort the catalog by headroom and saturation band and it falls into three zones. The strategic fact is that value has moved to the right, toward the benchmarks still reporting low numbers.
| Zone | Representative benchmarks | What the frontier looks like | How to use it |
|---|---|---|---|
| Saturatedsmoke tests | WebVoyager (~98%), Online-Mind2Web (~97%), Cybench (~100%), BrowseComp (~90%) | Frontier bunched at the ceiling | Cheap regression checks only; no longer resolves the frontier |
| Contestedread the harness | SWE-bench Verified (~88–95%), Terminal-Bench, OSWorld (~80%, past human), tau2 (telecom ~99%, airline ~76%) | Large vendor-versus-neutral gaps; variance across splits | Trust only with a disclosed harness and a per-split breakdown |
| Open frontierswhere the signal lives | K-Prize (7.5%), WorkArena++ L2/L3 (~0–2%), ARC-AGI-3 (~13% best agent; LLMs <1%; humans 100%), TheAgentCompany (~30%), GDPval (~41% vs 50% parity), Vending-Bench (~17% of human), PaperBench (~24%), ScienceAgentBench (~33%) | Large, durable headroom; many gaps trivial for humans | The zone still worth optimizing against |
Two patterns stand out. The biggest gaps sit on the most economically realistic tasks, which says the frontier is furthest from done exactly where the money is. And interactive or contamination-proof designs resist saturation: ARC-AGI-3, which is interactive, and the K-Prize, which is post-cutoff, are the constructions that stay hard. The full table, with sources and caveats per row, lives in the tracker.
Part VIWhat to do, depending on who you are
"Stop chasing the highest number" is good advice and useless on its own, because the three groups reading this need different things from a benchmark. The exam-era reflex fails each of them in a different way.
If you build agents
- Fix and disclose the harness. Your score is a model-plus-harness result; say which harness.
- Report a cost frontier, not a lone accuracy point. Cents-versus-dollars decides shipping.
- Run repeated trials. pass^k over many seeds, with confidence intervals, not a single lucky run.
If you buy agents
- Ignore the generic leaderboard. Commission a small eval on your own tasks and data.
- Demand pass^k and a failure audit: what happens when it fails, and is that recoverable?
- Price per successful task, not per token or per benchmark point.
If you govern agents
- Prioritize contamination-resistant and held-out benchmarks; treat static-board frontier scores skeptically.
- Fund reproducible, cost-controlled evaluation infrastructure rather than one-off claims.
- Make safety and autonomy evals first-class, now that capability thresholds are starting to gate releases.
A note against my own metaphor
The exam and the stopwatch are a teaching device, not the territory, and it is worth saying where the metaphor leaks. Not everything old was an exam: some static benchmarks always tested reasoning under interaction, and some "agentic" benchmarks are exams in a trench coat. Not everything new is real work: a long-horizon benchmark can be long and pointless, and dressing a task in tools and hours does not make it valuable. And a low score is not a badge of depth. A benchmark can report 2% because the task is profound or because the task is broken, and low performance does not automatically mean truth. Telling those two apart is the entire job. It is what a saturation band, a harness caveat, and a contamination flag are for.
Coda
So the encouraging version of this story, that we are finally measuring real work instead of trivia, is true but incomplete. A low number can mean the task is hard and honest, or it can mean the benchmark is badly built, and the only way to know is to look under the score. The next phase of AI evaluation will not be won by inventing harder leaderboards alone. It will be won by making the measurement stack itself auditable: the task, the harness, the cost, the variance across runs, the contamination risk, and the way the system fails. A benchmark that cannot tell you those six things is reporting a rumor, however many decimal places it carries. The standard I would ask of the whole field is the one I try to hold my own catalog to first.
Sources and method
Numbers come from the AI Agent Benchmarks Saturation Tracker (61 benchmarks, snapshot June 2026; open data at /data.json and /data.csv), cross-checked against primary sources:
- METR, Measuring AI Ability to Complete Long Tasks (time-horizon metric, arXiv 2503.14499) and How Does Time Horizon Vary Across Domains?
- S. Kapoor, A. Narayanan and colleagues, Establishing Best Practices for Building Rigorous Agentic Benchmarks (the ABC checklist, NeurIPS 2025, arXiv 2507.02825) and AI Agents That Matter (cost-controlled evaluation, arXiv 2407.01502), plus the Holistic Agent Leaderboard (HAL), Princeton.
- I. D. Raji and colleagues, AI and the Everything in the Whole Wide World Benchmark (construct validity, arXiv 2111.15366); Can We Trust AI Benchmarks? (EU JRC review, arXiv 2502.06559).
- Benchmark Data Contamination of Large Language Models: A Survey (arXiv 2406.04244).
- OpenAI, GDPval (arXiv 2510.04374).
- Benchmark primary sources: SWE-bench / Verified / Pro, OSWorld, WebArena, τ-bench (Sierra), the Konwinski Prize (Laude Institute), ARC-AGI (ARC Prize), Vending-Bench (Andon Labs), TheAgentCompany (CMU), and the meta-trackers Epoch AI and Artificial Analysis.