Twenty-Seven Characters

While similarly post-apocalyptic and “numbers-driven,” this post is not actually about a new NetFlix series. Rather, the main character of the story is HaluEval, a new “standard benchmark” for measuring whether a large language model is hallucinating — producing fluent, plausible-sounding text where it ought to be reporting a fact. The benchmark contains around 35,000 … Read more

What the Dashboard Didn’t Show You (Or, “The Denominator Moved”)

Roosevelt Elementary started Year 1 of Elevate with 100 students. It ended the year with 400. The other two schools held roughly steady. The district grew from 600 students to 900, and the composition of that denominator shifted decisively toward the lowest-scoring school. That one fact explains the entire dashboard. Elevate worked. Every school’s average … Read more

Know When to Hold ‘Em (or, “what is AI?”)

There is a lot of noise about AI safety these days, and I want to contribute to it in a specific and, I hope, useful way. Maggie and I are spending this year at the Russell Sage Foundation working on, among other things, how to make our theoretical work on classifiers understandable to a broader … Read more