Six posts since the last Dispatch, four of them a single arc on AI as a classification system. The arc started with a Vice President who could not evaluate her engineers, passed through a benchmark that turned out to be measuring string length, paused on a paper about elite athletes that was Berkson’s paradox in a tracksuit, and ended at the moment an LLM put a three-item menu in front of me and waited for me to pick. Plus the column debut I have been promising for a while, and a FEMA post about who actually owns the screwdriver.
What the Pipeline Knew
The arc started May 5 with Your AI Makes Bean Soup? Sure, But Mine Makes 7 Bean Soup! The setup was a Vice President at a large technology company who had to evaluate twelve engineers on their use of AI tools. She could not do this directly. The reason was the reason she hired them in the first place. If she could tell which judgment calls were the right ones, she would not have needed the engineers; the firm would have made those calls itself. The hire presupposes the inability to evaluate. The evaluation presupposes the ability. The firm cannot live with the contradiction stated plainly, so it reaches for proxies — token counts, AI-commit percentages, quarterly spend per engineer. The proxy arrives on the dashboard looking like an answer to a question it is not answering. Token count answers how much AI did the engineer use, which bears the same relationship to is this engineer using AI well that prescription count bears to whether a doctor is any good. The same structure sits underneath Ziad Obermeyer’s 2019 Science result on a hospital algorithm used to flag patients for additional care: the algorithm could not measure who needed extra care, so it measured historical healthcare costs, and Black patients with the same level of clinical illness as White patients were less than half as likely to be flagged because they had less access to begin with and therefore generated less spending. The algorithm did exactly what it was built to do. It just was not built to do what the hospitals thought it was built to do.
Twenty-Seven Characters applied the same shape one layer up the stack — to the benchmarks that are supposed to evaluate the models. HaluEval is a 35,000-example standard for whether an LLM is hallucinating. A recent analysis points out that you can score 93.3% on it by ignoring the model entirely and flagging any answer over 27 characters as a hallucination. The trivial classifier beats the careful one. The reason is that the benchmark’s pipeline asks a generator to produce plausible-but-false answers alongside truthful ones, and made-up answers, on average, have to do more work — supplying the entities, dates, and connective tissue that a true answer can leave implicit. Length is not measuring fabrication. It is measuring which side of the pipeline an answer came from. The benchmark is itself a classifier, and the benchmark just got caught doing exactly the thing it was built to catch the models doing. Patch the length cue and something else will be the tell — lexical diversity, hedging frequency, token probability, pick your favorite latent feature. That is the conservation-of-impossibility move applied to dataset construction: a fix that does not change the underlying structure relocates the failure rather than resolves it.
The Thursday morning post, Top of Whose Class?, turned the same screw a different direction. A new paper in Science on the highest performers in athletics, music, science, and math reported that peak adult performance is negatively associated with early performance — don’t push your kid, let them play in the yard. Andrew Gelman flagged the paper with the kind of sigh the rest of us should learn to share. The result is Berkson’s paradox, more or less unaccompanied. Sample only successful actors and you will find talent and looks negatively correlated within the sample even if they were independent in the general population, because the selection rule for inclusion is “one or both,” and conditioning on the disjunction induces negative correlation among the conjuncts. Substitute “successful actor” with “adult-elite athlete” and the rest writes itself. The formal sentence is one Maggie and I keep making in less athletic registers. There is no act of measurement that can extract a population-level fact about early-versus-late development from a sample whose membership was determined by the very outcome under study. The selection rule for who gets into the dataset is itself the classifier.
The arc closed Thursday evening — same day as Top of Whose Class?, four hours later — with Menus of Questions (Or, How Are LLMs Like Restaurants?) Two posts in a day is not standard cadence here, but the menu argument sat too well on top of the morning’s piece to wait until Friday. The move this time was one layer down again — from the substantive answers a model produces to the menus of options it puts in front of the user. An LLM working with me presented three options at a particular moment in our conversation. The three were not random and they were not symmetric. They were the model’s prediction about what would let it collect a point on its training objective without taking on a harder counterfactual prompt. A model rewarded for correct answers that can curate the menu will, in equilibrium, curate it toward questions it can answer well. The screening result Maggie and I formalized in our AJPS paper is about a classifier whose population responds to its rule; the application here is loose but the structural logic is the same, with the menu in the role of the rule and the user’s selection rate in the role of the population’s behavioral response. The joke version is a menu that lists “(1) add 2 to 3; (2) draw a circle; (3) add 2 and 3 and draw a circle around the answer; (4) prove P=NP, or provide a counterexample.” Option 4 does not appear on the menu. The menu is the model’s reply to the prediction that a hard question is incoming.
Read as one argument, the four posts say something simple in four forms. The dashboard, the HaluEval pipeline, the Science sample, and the LLM’s menu are all classifiers, and all four classifiers were built by procedures that determine, in advance, what they are capable of telling us. They do not describe their objects. They constitute them.
The Column Opened with a Left Turn
Wednesday brought the debut of There are no stupid questions…, the new advice-column feature this site has been building toward for a while. The conceit is straightforward: pseudonymous letters from cities I have actually lived in, signed in ways that occasionally double as inside jokes, with answers that take the questions seriously enough to find the formal point underneath. The opening letter came from “Left Behind in Greensboro,” a Pittsburgh native who had just discovered, the hard way, that the Pittsburgh Left does not travel.
The Pittsburgh Louie — left-turner goes first when the light turns green, oncoming traffic permits it, drivers behind the left-turner expect it — is what an intersection without a turn lane evolves when oncoming density makes the codified rule fail across cycles. Strip the turn lane out and the codified rule’s failure mode is not one driver’s inconvenience but a queue of through-traffic forced to wait on a turn they are not making. The convention transfers two seconds of oncoming green to one cycle of left-turn priority, and almost everyone wins, including the drivers behind the left-turner who never had to learn the convention because they were already going where they wanted to go. A dedicated turn arrow is the same solution implemented in hardware. Pittsburgh, denied the geometry, evolved the convention instead.
What “Left Behind” took to Greensboro was the rule. What couldn’t make the trip was the common knowledge — or, more importantly, the constraint the common knowledge was quietly answering. The Greensboro intersection looks like a Pittsburgh intersection, but the geometry is different, the turn-lane infrastructure is present in places it wasn’t in Pittsburgh, and the codified rule is, in the sense that matters, a different rule. The honk was not for the maneuver. It was for the assumption.
There is more to say about why a column called “There are no stupid questions…” needs to exist when one already runs Math of Politics, but I will save it for a Dispatch where the column is not itself the news.
The Screwdriver Has Not Moved
Monday’s post, FEMA Holds the Screwdrivers, used the junk-drawer framework I introduced a few weeks ago to look at the recent reporting on Victoria Barton’s internal memos at FEMA. The post’s central point is that FEMA is a junk drawer in the structural sense — the federal government’s designated location for the cross-cutting work that hurricanes, wildfires, floods, and derechos refuse to file inside any single agency category. A working FEMA is not a failure of organization. It is the part of the organizational system that handles the screwdrivers, the things that touch enough other categories to break whichever single category you might file them inside.
The Barton memos document a junk drawer that has become messier than it needs to be — 661 federally declared disasters still open, 348 of them more than five years old, the rubber bands at the back of the drawer that prevent the drawer from closing. The recognizable household fix is a surge to clean it out. The structural piece of the story, which is what I wanted in the post, comes from Michael Coen’s remark that FEMA under prior administrations would not have asked DHS for permission to make these administrative changes — FEMA would inform DHS and proceed. The screwdriver has not moved. The decision about the screwdriver has. DHS is the residual claimant for FEMA’s risk and is appropriately briefed on big strategic moves, but the residual claimant is not, structurally, the same as the agency that knows what the screwdriver is for in the dark with the power out while the rain is coming sideways.
Read together, the week’s posts share a shape. The formal scheme names something as doing the work — the dashboard that evaluates engineers, the codified rule that governs the intersection, the org chart that places FEMA under DHS — and then the work gets done by whatever apparatus the scheme is sitting on top of. The gap between the two is the only interesting place to look.