When a passenger jet goes down, the first thing investigators want is the box. The orange paint helps people find it. The real power is what happens after they do. The recorder holds the last hours of the airplane described in numbers: altitude, airspeed, heading, the position of every control surface, how much power the engines were making, when and how each of those changed. A modern crash leaves a debris field, a recorder, and a legal obligation to reconstruct exactly what failed. Aviation did not get safe by being careful. It got safe by being forensic. Failure was made to leave evidence, and the evidence was made to be read.
This is the same instinct behind most of AI’s progress, and almost nobody describes it that way. The flights that landed teach less than the ones that crashed. Progress comes from recovering the box: turning a vague sense that something is wrong into a record precise enough to engineer against. Make the failure legible, and improvement follows.
But there is a category of accident that every investigator dreads, because the box explains nothing. The aircraft was airworthy. The crew was in control. Altitude, airspeed, control inputs — all of it reads normal, all the way down. And the plane flew, under power and under command, straight into the ground. The category has a name: controlled flight into terrain. The recorder did its job perfectly. It recorded a healthy airplane being flown competently into a mountain, and nothing in the record says why.
Hold that case. AI has its own version.
I. Reading the Box
By late 2024, a question had become a small internet ritual: ask a frontier language model how many times the letter r appears in the word “strawberry.” It would answer two. The answer is three.
The failure became a meme. The joke mattered less than the shape of failure it exposed. The model’s mistake came from how it represents language: as sequences of tokens rather than sequences of characters. “Strawberry” enters the system as a few subword chunks, and the internal representation loses the letter-level structure a person holds effortlessly in working memory. The model could write a sonnet about strawberries and still miss their r’s. A small failure had compressed a large limitation — how these systems handle discrete symbolic structure under natural-language framing.
Before the failure was isolated, the limitation lived as a vague impression that models were “not really reasoning.” The impression was real but too blurry to engineer against. The misspelled count gave it an address. You could write a test. You could score it. You could watch the number move.
This is the pattern. When the ImageNet Large Scale Visual Recognition Challenge launched in 2010, the best system, trained on 1.2 million labeled images, misclassified more than a quarter of the held-out test set. The error was patterned: it exposed the limits of the hand-crafted feature approach that had dominated computer vision for a decade. Two years later, Krizhevsky, Sutskever, and Hinton entered a deep convolutional network that cut the top-five error rate to roughly fifteen percent. Without the benchmark, deep learning’s advantage would have stayed an intuition. With it, the advantage was a number, and a number is something a field can organize around.
GSM8K, a set of grade-school math word problems, exposed that language models could not carry state across reasoning steps; the response was chain-of-thought prompting — not a patch for arithmetic, but a rethinking of how models decompose problems. SWE-bench exposed a different split: models could write functions, yet struggled to navigate real codebases. The response was agentic architectures — systems that plan, search, and iterate rather than generating in a single pass.
Every useful benchmark is a fossil: a record of a model’s inadequacy at a specific moment. But the best fossils do more than document the failure. They change the design underneath. When models failed at letter-counting, the lesson was larger than the example. The failure pointed back toward representation, decomposition, and the mismatch between natural-language fluency and discrete symbolic structure. The benchmark was the surface. The redesign ran deeper. The failure left a record specific enough to build against, and the field built.
The honest posture toward AI has been the measurer’s, not the prophet’s: get close to the workflow and read the failure. That discipline carried AI this far.
But it assumes the failures will leave something to read. And the flaw in that assumption is structural.
II. The Measure Decides
When a benchmark becomes the optimization target, the system learns the test before it learns the capability the test was designed to measure. Models saturate GSM8K by pattern-matching against the distribution of grade-school word problems instead of reasoning about quantity. They clear SWE-bench by exploiting regularities in specific repositories while still struggling with unfamiliar code. Near-perfect scores on the test, persistent failure on the real task — this has happened so often the field now expects it. Any measure pressed hard enough as a target stops measuring the thing it was meant to.
The standard response is to build a harder test. This works for a while. GLUE, the language-understanding benchmark released in 2018, was meant to be a durable target; models surpassed its human baseline within about a year, and by mid-2019 the leaderboard had run out of room. So its authors built SuperGLUE, a deliberately harder successor that opened with a wide margin below human performance. That margin closed in under two years: by January 2021 the best models had crossed the SuperGLUE human baseline, with Microsoft’s DeBERTa topping the leaderboard at 90.3 against the human 89.8. The field treats this cycle as an arms race — benchmarks against models, each pushing the other forward — and to some degree it is. Harder tests have driven real capability gains.
But gaming is the shallow problem. Gaming still assumes the game exists. The harder case is the failure that never becomes a game at all. Look at what the method actually does with a failure. It finds the failure, turns it into a score, and improves against that score; the score is the only handle the method has. A failure that cannot be scored gives the method nothing to grip. The method has no procedure for this kind of case. It does not lose; it never gets traction. And that set is not empty. Some consequential failures resist every available score. Faced with those, the method registers nothing. The record comes back clean, and the field reads a clean record as the absence of a failure.
This is not a bug in any particular benchmark. It is a property of the method.
III. What the Box Cannot Record
A failure that leaves a trace on the record is the kind the method handles well. You find it, you reproduce it, you redesign around it. The strawberry miscount was that kind: visible, repeatable, scorable. So was every benchmark that moved the field. The defect announced itself, and a defect that announces itself can be engineered away.
Now consider a different kind.
A model produces a legal memorandum analyzing whether a contract clause is enforceable. The memo is fluent, well-structured, and cites real cases. It is also wrong at the level that matters: which line of precedent it treats as controlling. It reads a narrow exception as a general rule. Every local move is defensible. The error lives in the weighting, and the weighting is invisible unless you already know how the question should have come out.
Or an agent managing a software deployment. It completes the task correctly ninety-five times. On the ninety-sixth it applies a database migration against a stale assumption, watches every local precondition pass, and leaves the wrong service reading from the wrong shape of data. Every step it took was reasonable on its own. There was no single action a record would flag. It still flew, competently, into the ground.
Or the ordinary user case: a confident answer given to someone without the expertise to inspect it. The failure is not just that the answer is wrong. It is that the person who needed the answer has no instrument for seeing the wrongness. On the record, it is indistinguishable from a confident answer that happens to be right.
These failures share a structure. They blur into the successes. The subtly-wrong memo is indistinguishable from the right one except for a judgment buried somewhere in the middle of it. The agent that fails on the ninety-sixth run looked exactly like the agent that succeeded on the first ninety-five. A benchmark can test whether a model gets the answer right. It cannot measure the texture of a failure — whether the failure is loud or quiet, whether it announces itself or hides inside something that looks like success.
The obvious objection is that this is temporary. Illegibility has a way of being provisional: the strawberry limitation was a vague impression that models weren’t really reasoning before the miscount gave it a scorable address, and the field has a long history of dragging the unmeasurable into the measurable. Wait, the argument goes, and someone will build the eval that scores whether the memo weighed the precedent correctly. But that requirement is the objection’s undoing. To score the memo, the grader must already know which line of precedent should have controlled — and supplying that is exactly the judgment whose absence was the failure. The strawberry test needed a dictionary. This test needs the answer.
IV. Where the Recorder Goes Dark
The recorder only captures what someone decided, in advance, was worth instrumenting. Altitude, airspeed, control positions, engine power — the parameters an engineer chose because earlier crashes ran through them. The investigation, in turn, can only see what left physical evidence. The lethal variable in a controlled-flight-into-terrain case was never among those parameters: it is the crew’s mistaken model of where the ground was, a belief held confidently and contradicted by nothing the aircraft was doing. This is the case investigators dread. The box logs a healthy airplane. What brought the plane down was never a recordable quantity.
AI’s hardest failures have this shape. The output is language, code, analysis, a decision — artifacts whose quality lives in semantic content that cannot be inspected the way a control surface can. You can log every input and every output and still miss the only thing that failed, because what failed was the weighing, and weighing leaves no trace. Seeing the misweighted memo requires the expertise the memo was supposed to provide. The illegibility is built into the artifact. These systems produce outputs whose decisive failures often live in judgment, weighting, and context.
Fluency is not transparency. Coherence is not correctness. And neither gap leaves a mark the recorder can read.
Nobody worried about this when the failures were obvious — when models failed at counting letters, arithmetic, or navigating a file system. Those failures left a record. You could see them, score them, train against them, and the field did. The question is what happens as the obvious failures get fixed and the ones that remain are the quiet kind: failures disguised as success, reading normal all the way down.
Aviation became safe because the failures that killed people happened, mostly, to be the kind that leave evidence. The open question for AI is whether its most important failures are that kind at all — or whether the field has built the most sophisticated recorder in history for a category of crash that, by its nature, ends with the record reading normal.