Dispatches from the Underground, May 15, 2026

Six posts since the last Dispatch, four of them a single arc on AI as a classification system. The arc started with a Vice President who could not evaluate her engineers, passed through a benchmark that turned out to be measuring string length, paused on a paper about elite athletes that was Berkson’s paradox in … Read more

Menus of Questions (Or, How Are LLMs Like Restaurants?)

Earlier today, while I was working with an LLM on something, it asked me a question. Here is the question. Three options. They differ from each other in ways the menu makes plain — inherit the body’s numbers and flag, correct everywhere, or correct the appendix and flag the body as wrong. The point of … Read more

Twenty-Seven Characters

While similarly post-apocalyptic and “numbers-driven,” this post is not actually about a new NetFlix series. Rather, the main character of the story is HaluEval, a new “standard benchmark” for measuring whether a large language model is hallucinating — producing fluent, plausible-sounding text where it ought to be reporting a fact. The benchmark contains around 35,000 … Read more

Your AI Makes Bean Soup? Sure, But Mine Makes 7 Bean Soup!

A Vice President at a large technology company sits down at the end of the quarter to evaluate her engineering team. She has twelve direct reports. They have spent the quarter using AI tools heavily, as the company’s leadership has insisted they do. She must now write performance reviews. She has to say, in writing, … Read more