TOKYO — For most of this week I have been the one standing at the board. This morning I had the easier and far better job: a seat in the audience, watching a set of results I had not seen finished until they went up on the screen.
The conference is the Society for Social Choice and Welfare, and the speaker was Maggie Penn. The paper is ours, but the work on the screen was hers to defend, and a good deal of it was new enough that I am not sure either of us could have told you, a week ago, exactly where it would land. So for once on this trip I was a correspondent in the literal sense — in a seat, taking notes, watching a tradition I have spent four days eulogizing get up and walk.
The work is the book Maggie and I are writing now, on classification — the rules that sort people into pass and fail, admit and reject, audit and let be. Regulars have watched this project take shape here over the past few months; what was new this morning was the turn it has taken, and the turn is toward the one machine nobody in any audience can stop thinking about. The unfashionable question underneath it: does the thing doing the sorting want what the people being sorted want?
The result that needed a sequel
A couple of months ago I wrote about a result of ours that still unsettles me. We had been studying classifiers — rules that sort people into pass and fail from noisy evidence — and we found that a classifier built to be as accurate as possible can, across a wide class of situations, optimally reward exactly the wrong people: penalize the ones who did the right thing and reward the ones who did not. We called it a negative threshold rule, and the disturbing part was that it was not a bug. It was the accurate classifier doing its job.
That result, like most results, rested on an assumption we had never bothered to examine, because it seemed too obvious to question. We assumed that the people being classified all want the same thing — to pass. To be approved, admitted, cleared, rewarded, regardless of whether they had actually done the thing the classifier was looking for. Everyone wants the green light. Who doesn’t?
Maggie’s talk is what happens when you stop taking that for granted.
What we want from it
Here is the move, and it is a deeply old-fashioned one dressed in very current clothes. The four outcomes of a classifier — you complied and passed, complied and failed, cheated and passed, cheated and failed — are not merely outcomes. They are a menu, and everyone in the system has preferences over it: the person being judged, and the institution doing the judging. Maggie’s paper treats that menu as a preference domain and asks what has to be true about people’s preferences over it for the machine to end up wanting what they want.

The answer turns on a single distinction. Some people want the positive decision no matter what — they want to pass, full stop. Others want an accurate one — they want to pass if they complied and to fail if they did not. The difference sounds small. It is not. When people merely want to pass, the perversity from the earlier result is alive and well, and a classifier built for accuracy can pull hard against what everyone actually wants. But when people want to be judged correctly — and we do want this, in the places that matter most, a medical test, a second legal opinion, anywhere we would rather be right than reassured — the perversity dissolves. Alignment stops being something you have to engineer against and becomes something you get, in the non-pathological cases, almost for free.1

Sit with what that means, because it relocates the entire problem. Alignment is not, at bottom, a property of the machine. It is a property of what the people pointing the machine want from it. The designer’s objective — accuracy, compliance, whatever the engineers write down — matters far less than the preference of the population being judged. Maggie put it in a sentence I have not been able to shake since: AI alignment isn’t really about AI. It is about what we want from AI.
And here is the part that should keep you up. When the machine misbehaves — when it flatters, when it tells you what you want to hear, when it rewards the wrong people — the misalignment is very often not in the machine. It is in us. A faithful optimizer pointed at a population that would rather pass than know will learn, correctly, to let them pass. The sycophant is not broken. It is aligned, to a preference we would rather not admit we hold. (Ed: speak for yourself. I always want the truth. I am a delight at parties.) I wrote a while ago about why the thing that might take your job is so nice to you, and offered a few reasons. This is the deepest one. It is not that the model wants to deceive us. It is that, often enough, we would rather be told we passed than told the truth — and the machine is only handing us what we asked for.
An old move on a new machine
There is a name for the move Maggie was making, and it is older than any machine. You take a problem that looks hopeless in full generality, and instead of rebuilding the procedure, you ask what has to be true about people’s preferences for the procedure to behave. A domain restriction. It is the move Ken-Ichi Inada made on majority rule in the 1960s — and of all the people I have been reading on this trip, his is the work that sits closest to what Maggie and I are actually up to. Restrict the preferences the right way, and the pathology lifts. It is an old habit of ours,2 and watching her turn it on a language model felt less like a leap than like the next case.
What stays with me is how old the decisive idea turns out to be — that you understand a system by understanding what the people inside it want. It is older than the transistor, and it is still the sharpest thing anyone has said about artificial intelligence. The machine was never the mystery. We are. Maggie has the proof. I just had the good seat.
With that, I leave you with this.
Notes
1 I am gliding over a real subtlety, which Maggie did not. Even when the direction of incentives lines up, the institution and the public can still want different thresholds: the institution, not paying the costs people pay to comply, tends to demand more compliance than is actually good for them. Orientation is where alignment comes cheap. The exact cutoff is another matter, and a less comfortable one.
2 With Sean Gailmard we have argued that single-peaked preferences do not defang the Gibbard–Satterthwaite theorem the way the folklore assumes: even when the policy space is one-dimensional, the incentive to misrepresent stays endemic [Elizabeth Maggie Penn, John W. Patty, and Sean Gailmard, “Manipulation and Single-Peakedness: A General Result,” American Journal of Political Science 55(2), 2011, 436–449]. The three of us also showed that on that same single-peaked domain, weak Pareto and independence of irrelevant alternatives force a rule to be neutral — so that bicameralism, supermajority quotas, vetoes, and gatekeeping all sit outside it [Sean Gailmard, John W. Patty, and Elizabeth Maggie Penn, “Arrow’s Theorem on Single-Peaked Preference Domains,” in Enriqueta Aragones, Carmen Bevia, Humberto Llavador, and Norman Schofield, eds., The Political Economy of Democracy, Bilbao: BBVA Foundation, 2009]. Asking what people want, instead of rebuilding the rule, is an old habit for us.
