From the Path: When Better Looks Worse

NANCY — A few days ago I promised a dispatch on the keynote, and here it is. The eighth International Conference in Philosophy and Economics met this week in Nancy, and the keynote was given by my closest collaborator, who is also my wife, Maggie Penn. Her title was “Three Paradoxes of Optimal Evaluation.” I sat in the back of the lecture hall and took notes like a correspondent (as the attendance, pictured below, suggests, Maggie gives a great talk). What follows is her argument; the work behind it is jointly ours, but the keynote was hers.

The keynote, from a correspondent’s seat.

Begin, though, not in the lecture hall but with a number. Somewhere in the machinery of the French state, an algorithm run by the national family-benefits fund scores millions of recipient households every month, assigning each a risk score, and the score decides who gets investigated. The detail worth holding onto is what the score is built to find: not fraud, but overpayments — sums paid out that should not have been. Overpayments are far more numerous than fraud and, unlike fraud, need no proof of intent; they are mostly unintentional errors, and they fall hardest on the poorest recipients, for whom the rules are hardest to navigate.¹ The system is now before the Conseil d’État, where a coalition of twenty-five organisations wants it banned; the agency, for its part, published the source code in January.

The public argument runs on two tracks. One is transparency: publish the code, and the system can be inspected. The other is bias: inspect the code, and you find it leaning on poverty and disability. Maggie’s keynote says, in a precise sense, that both tracks miss the harder problem. The familiar paradoxes of social science — Arrow’s theorem, the Ostrogorski paradox, Sen’s paradox of the Paretian liberal, the discursive dilemma, Simpson’s paradox — are aggregation paradoxes: a property that holds for every part fails for the whole.² Her three are a different family. They are optimization paradoxes: a rule pursues an objective at one level and, at another level, produces a violation of that very objective. Simpson’s paradox, it turns out, has been keeping company we don’t expect. And none of what follows is a claim about what the French algorithm does in fact; it is a claim about what optimization itself produces — which is exactly why neither auditing the code nor naming the bias closes the question.

Three ways the metric lies

Take the first paradox, which we currently dub, with proper head nod toward Men Without Hats) the Safety Paradox. An institution/social planner decides who gets access to something — a benefit, a licence, release before trial. There are, for simplicity, two types of people: safe and harmful. Safe people are those to whom granting access is socially optimal, but granting access to a harmful person is undesirable (for whatever reason). The planner cannot perfectly measure/observe who is “safe”; instead, the planner sees a noisy score, with a higher score meaning that the person is more likely to be a safe type, and a lower score means the person is more likely to be a harmful type. The planner’s (socially optimal) strategy involves “simply choosing a threshold” to maximize welfare (in this setting, this involves establishing a “minimal score” for being granted access, like a minimal height restriction at an amusement park). The intuitive comparative static holds: face a safer population, and the institution can afford to relax the threshold. Now consider the number an auditor actually reads — the equilibrium harm rate, \(H^*(p)\), the share of those granted access who go on to cause harm. In real-world analogues, this share might represent the reoffense/recidivism rate, the fraud rate, the wrongful-admission rate, or countless other “measures of Type-I error.” In other words, \(H\) is a theoretically observable quantity that auditors measure, politicians point to, and the public may respond to.

The paradox is that as the population gets safer — as \(p\), the share of safe people, rises — that harm rate can rise with it. Two effects — one direct (meaning it arises simply because \(p\) is rising) and the other indirect (meaning it arises because the system — in this case, the social planner’s threshold — is responding to the rise in \(p\)) are in play, and they pull opposite ways. The direct effect: fewer risky people in the population, which lowers the rate. The indirect effect: the welfare-maximizing planner, seeing a safer population, rationally relaxes its threshold and admits more people — including more of the risky few who remain — which raises it. On a real range of population safety, the indirect effect wins.³ The harm rate, then, is not a reading of the population. It is the joint product of the population and the rule’s response to it, and it can get worse precisely because the world got better. An audit that treats the harm rate as a measurement of the population is assuming the very thing it claims to check.

A three-panel figure. Safety: a curve of the observed harm rate against population safety rises as the population gets safer. Accuracy: two groups with identical data are rewarded at opposite ends of the score. Predation: a more lenient rule produces more punishment. In each, a green element and a red element move in opposite directions. — The three paradoxes, each a place where the world (green) and the number an auditor reads (red) move in opposite directions. The safety panel is computed from the keynote’s own working example; the other two use its worked figures.

The second paradox moves from a fixed population to one that responds — people decide whether to comply, anticipating the rule that will judge them. Take two groups identical in everything the data can see, with signals of exactly the same quality, differing only in how costly compliance is for their members. The rule that maximizes classification accuracy does not treat them alike. For one group it rewards high scores, the rule intuition expects. For the other it can be optimal to reward the lowest scores — a rule that runs against the evidence — because accuracy improves when the rule drives a group’s behavior toward an extreme, where it becomes easier to sort.⁴ Two groups, identical information, and the pursuit of accuracy pushes them to opposite poles. No biased data and no biased designer are required to produce disparate treatment; optimizing for accuracy can manufacture it on its own.

The third paradox concerns intent. Call a designer more predatory if it places extra weight on outcomes that punish people — revenue from fines, the politics of looking tough. You would expect a more predatory rule to be harsher. It can be the reverse. A more predatory rule can be more lenient, setting a lower bar to pass, because leniency is bait: it draws more people into non-compliance, and more non-compliance means more punishment in total. Each person faces an easier test and a smaller expected penalty, and yet more of them are punished. Sharper still, the more predatory rule can be the one every person being evaluated would prefer in advance. Leniency tells you nothing about a designer’s intent, and the volume of punishment a rule produces tells you nothing about whether the people under it are better or worse off.

What the audit can’t see

Three seemingly distinct paradoxes, with a common source: a rule — such as an algorithm — that optimizes makes familiar outcomes misleading. Observed harm does not imply a more dangerous population. More punishment does not imply that people are worse off. Lower measured accuracy does not imply that people are treated less accurately. The aggregation paradoxes tell us that parts and wholes can disagree; these tell us something sharper, and less comfortable — that observations we treat as evidentially transparent are nothing of the kind.⁵

This series has circled measurement and auditing from its first dispatch: a map that was honest about what it could and could not encode, a tunnel whose capacity no one could agree on, a mathematician called in to audit a number an institution had abused. Nancy, as it happens, has mirrored spheres set about several of its squares, and each one returns a perfectly faithful image of the street bent into a shape the street does not have. That is the keynote in a single object. The French agency can publish every line of its code — it has — and the faithful record can still be a bent measure. The transparency question is answerable. The harder question is what the numbers we would hold the system accountable by actually measure, and the keynote’s answer is that they do not measure what we assume.

The keynote takes no side in the French case, and that restraint is the point of it. It hands the agency’s critics no villain, because the troubling patterns need no malice and no biased data, only optimization; and it hands the agency no exoneration, because optimizing carefully and publishing the result does not make the resulting numbers mean what its defenders need them to mean. It makes both of the moves the argument currently runs on insufficient. And it sharpens the question the last dispatch from Nancy left open: if order in collective life is imposed rather than found, the thing worth asking is whether it is imposed legitimately — and that is not a question you can answer with instruments you cannot trust.

That closes the first cycle of these dispatches — two from London, two from Nancy. Next month the desk moves to Tokyo, and then to Montreal.

With that, I leave you with this.

Notes

¹ The system is the risk-scoring tool run by France’s national family-benefits fund, the CNAF, and known as the DMDE; it scores millions of recipient households each month and selects which of them to investigate. By the agency’s own description it targets the households most at risk of having received an indu — an overpayment — rather than fraud as such. A coalition that has grown to twenty-five organisations, led by La Quadrature du Net, is asking the Conseil d’État to ban the system on data-protection and non-discrimination grounds; the CNAF published the source code of its current version in January 2026. The account here draws on the coalition’s filings, the French ombudsperson’s opinion to the court, and investigative reporting by Le Monde and Lighthouse Reports.

² The canonical aggregation paradoxes include Arrow’s theorem and the Condorcet cycle, the Anscombe and Ostrogorski paradoxes, Sen’s paradox of the Paretian liberal, the discursive dilemma, and Simpson’s paradox. In each, a property that holds at the level of the parts — individuals, subgroups, separate questions — fails to survive to the whole. This blog has taken up Simpson’s paradox before, most directly in Pick One from Three.

³ Formally, the derivative of the equilibrium harm rate with respect to \(p\) splits into a direct effect, which is negative — a safer population holds fewer risky people — and an indirect effect, which is positive — the welfare-maximizing planner relaxes its threshold in response, admitting more. On a non-empty interval of population safety the indirect effect dominates, and \(H^*(p)\) rises. The corollary the keynote draws is that \(H^*(p)\) is not a valid measurement of \(p\): an audit that reads one off the other assumes the answer it sets out to verify.

⁴ The underlying result is that an accuracy-maximizing rule takes one of exactly two forms — rewarding the signals above a threshold, or rewarding the signals below it — and which of the two is optimal depends on a group’s cost of compliance rather than on the quality of its data. Two groups with signals of identical quality can therefore be driven to opposite extremes of behavior.

⁵ The three results are drawn from recent joint work by Maggie and me on optimal screening and classification, including a paper forthcoming in the American Journal of Political Science and the book on classification we are now completing.

Three ways the metric lies

What the audit can’t see

Notes

Share this:

Like this: