Getting the Song Backwards (Or, AI := -1 * Goodhart * Bayes)

Naked Capitalism linked to Tuesday’s post, which brought some welcome new readers to this blog. For those arriving fresh: welcome, and this post is a good place to start — it’s the second in a series, but it’s written to stand on its own.

In the last post, I wrote about work that Maggie and I have been doing on the basic architecture of a classifier: evidence comes in, a score comes out, a threshold converts the score into a decision, and someone bears the cost of being classified wrong. We toured seven of them — umbrellas to bail hearings — and ended on what we called a positive threshold rule. The logic is as intuitive as folk wisdom: act when the evidence is strong enough, not before. Know when to hold ’em. The threshold runs with the evidence.

I also made a promise. Under certain conditions, I said, an accuracy-maximizing classifier will violate this rule — not by accident, not because of a bug, but as the optimal solution to the objective the designer specified. It will say yes when the evidence is weakest and no when the evidence is strongest. We call this a negative threshold rule, and we said it was the right formal lens for understanding why AI systems hallucinate.

Today I make good on that promise. But first, I want to show you that negative threshold rules are not exotic. They are already built into systems you interact with every day, often by design, sometimes by accident, always with consequences for the people being classified.

When the System Rewards Low Scores

Consider a physical education class that grades not on absolute performance but on improvement between two tests. The score is your gain from test one to test two. The threshold is whatever gain the teacher has decided constitutes a passing grade.

Notice what this classifier rewards. A student who performs well on the first test has little room to improve; the threshold for a good grade is now, for practical purposes, almost out of reach. A student who performs poorly on the first test — whether through genuine difficulty or strategic sandbagging — has built in a great deal of room to “improve.” The classifier, designed to measure growth, has created a negative threshold rule with respect to first-test performance: a lower initial score raises the probability of being classified as a success. The rational response to this structure, if you know about it before the first test, is to underperform on purpose. The designer wanted to measure learning. The classifier rewards mimicking it.¹

This is not a hypothetical. It is the formal structure underlying every grading, incentive, or assistance scheme that measures relative rather than absolute outcomes — and there are more of these than you might think.

Defense attorneys understand it intuitively. There is a well-known and entirely rational practice of a lawyer telling a client, early in a case, to stop talking — not just to the other side, but to the lawyer herself. “Don’t tell me anything more” is not obstruction; it is information management. The attorney’s obligation runs to the client’s legal defense, not to the truth. What the lawyer knows can constrain what she can argue. What she doesn’t know cannot. Keeping the posterior low — staying below the threshold of “fully informed” — is the optimal strategy for a classifier (the court system) in which knowledge of guilt, once acquired, becomes a liability.²

The Medicaid eligibility rules for home health care workers in many states encode a negative threshold rule directly into statute. A worker who earns below a certain income threshold qualifies for Medicaid coverage. A worker who earns above it — by working more hours, taking on an additional client, accepting a modest raise — can lose coverage whose value exceeds the income gained. The classifier is designed to identify need. What it actually does, at the margin, is penalize the evidence of sufficiency. Work harder: lose coverage. The worker’s rational response is to stay below the threshold, which means working less than she otherwise would.³ The designer wanted to provide a safety net. The classifier has built a ceiling.

The same structure recurs across means-tested programs generally: asset limits for food assistance, income cliffs for housing vouchers, resource thresholds for long-term care. Each of these is a classifier that produces a positive decision when the evidence of need is strong — and withdraws it, abruptly, when the evidence weakens past a cutoff. The intent is to target resources at those who need them most. The effect, at the margin, is a negative threshold with respect to self-sufficiency: demonstrating that you are doing better is the event that triggers reclassification as ineligible. There are genuine policy reasons for this structure, and serious people defend them. But the incentive it creates for the people who truly need the safety net is probably not the one the designer intended.

The most consequential contemporary example may be the chilling effect of immigration enforcement on immigrant communities’ use of public services — hospitals, schools, police, courts. As I wrote about in “The IRS Is Here to Help. So Is ICE.”, the optimal response for an undocumented person living under aggressive enforcement is not to avoid doing anything wrong. It is to avoid being seen doing anything at all. Every interaction with an institution — even a beneficial one, even one the person is legally entitled to use — raises the probability of detection. The threshold is not “have you committed a crime.” The threshold is “have you generated a data point.” The negative threshold rule runs with respect to visibility: the less evidence you leave of your presence, the safer you are. The cost falls not on the person who is actually a threat to public safety but on the person who merely exists in proximity to a system designed to find threats.⁴

When Accuracy Runs Backward

The examples above share a common structure: a designer set up a classifier in which, for a specific population at a specific threshold, the score runs in the wrong direction. In each case, you can point to the design decision that created it — the improvement-based grading rubric, the income cutoff, the data-sharing agreement. These are fixable, in principle. You could change the grading scheme, smooth the income cliff, build a firewall between tax records and immigration enforcement. The negative threshold is an artifact of the design, not a property of classification itself.

What Maggie and I show in our formal work is something harder to fix: there is a class of situations in which a negative threshold rule is not the result of bad design but the optimal solution to the designer’s stated objective.⁵ The designer wanted an accurate classifier. The accurate classifier has a negative threshold. These are not in conflict. Understanding why requires sitting with a feature of classification that the gym teacher, the Medicaid statute, and the ICE data-sharing agreement all share but none of them fully reckon with: the classifier shapes the behavior it is trying to classify.

Think about what it means for a classifier to be accurate. Accuracy is measured against outcomes — did the person classified as a speeder actually speed? Did the person classified as a fraud risk actually commit fraud? But those outcomes are not fixed. People respond to the classifier. A city that tickets indiscriminately gives drivers no reason to slow down; a city that tickets only speeders gives everyone a reason to drive safely. The classifier does not just observe behavior — it shapes it. And a designer who wants to be accurate has to take that feedback loop into account, because the behavior the classifier is trying to predict is the same behavior the classifier is producing.

Before going further, it is worth naming the baseline. When behavior does not respond to the classifier — when the people being classified act the same way regardless of what rule the designer uses — the optimal classifier is always a positive threshold rule. This is the world most statistical and machine learning practice implicitly assumes: the data-generating process is fixed, the designer’s job is to learn it accurately, and rewarding the positive signal and penalizing the negative one is the right call. The positive threshold rule is not a convention or a habit. It is the correct answer to a well-posed problem, and it also shifts the base rate — by encouraging compliance, it drives prevalence upward, which can itself improve accuracy when compliers are harder to identify. This is the world the folk wisdom describes. Know when to hold ’em.

This is worth pausing on. Bayes’s rule — the standard prescription for rational belief updating — implies a positive threshold rule whenever the underlying population is fixed. If the prior and the signal are well-specified and the world doesn’t change in response to your decisions, the Bayesian classifier always rewards the stronger signal and penalizes the weaker one. There is no version of Bayes’s theorem that recommends penalizing the positive signal to achieve accuracy. A negative threshold rule is not just unconventional. In a static world, it is irrational. The result Maggie and I establish is that performativity — the dependence of behavior on the classifier — is precisely the condition that breaks this. When the world responds to the rule, the Bayesian answer is no longer optimal, and the optimal answer can be its exact inverse.

The problem arises when behavior does respond to the classifier. Both positive and negative threshold rules shift the base rate — they just push in opposite directions. A positive threshold rule rewards the signal of compliance and discourages non-compliance, driving prevalence up. A negative threshold rule penalizes the signal of compliance and rewards non-compliance, driving prevalence down. Neither is inherently more “base-rate-shifting” than the other. What differs is the direction, and the direction that produces higher accuracy depends on where the prior sits and how strong the signal is.

Now the negative threshold becomes less mysterious — and more unsettling. Consider a population in which compliance is costly — where most people, if left to their own devices, will not comply, and the signal the classifier receives is accordingly skewed heavily toward non-compliance. A designer who wants to maximize accuracy in this population faces a stark arithmetic: if almost nobody complies, then a classifier that classifies almost everybody as non-compliant will be accurate almost all of the time. The optimal accuracy-maximizing algorithm in this setting can be exactly this perverse: penalize the people who send the positive signal (the rare compliers), reward the people who send the negative signal (the majority). Classify backward. Achieve accuracy by predicting the base rate rather than the individual.

But here is what makes the result genuinely greedy rather than merely ironic. The negative threshold rule doesn’t just identify a low-compliance population — it induces the members to comply even less. By penalizing compliance, the rule discourages it further, driving prevalence down. And as prevalence falls, predicting non-compliance for almost everyone becomes even more accurate. The rule is self-vindicating: the more aggressively it discourages compliance, the more accurate it looks. An accuracy-maximizing designer who commits to a negative threshold rule is not making a one-time miscalculation. She is locking in a feedback loop in which her measure of success improves precisely as the population she was supposed to serve gets worse.

It helps to see why the designer would want this. The hardest case for an accuracy-motivated classifier is the one where the prior gives you nothing — where compliance and non-compliance are equally likely before any signal arrives. Think of the hardest kind of multiple-choice question: the one where all answers are equally plausible. When the prior is flat, the signal determines everything, and a weak signal leaves you nearly blind. A designer who can commit to a classifier in advance has a way to escape this: use the classifier to shift behavior, and let the shifted base rate do the work the signal cannot. A positive threshold rule escapes by pushing prevalence up; a negative threshold rule escapes by pushing it down. Which direction is better depends on the signal quality and the cost structure. The designer doesn’t solve the hard problem. She escapes it by reshaping the population until the problem is no longer hard — and sometimes the reshape runs backward.

The threshold flips. The rule runs backward. And — this is the part that should genuinely disturb you — it does so not because the designer made a mistake but because the designer was trying to do the right thing. Accuracy, pursued single-mindedly in a world where behavior responds to the classifier, produces a classifier that has given up on the people it was supposedly designed to reach. The designer optimally misclassifies.

Here is an analogy I find compelling, though I want to be clear it is an analogy rather than a derivation. Suppose a classifier wants to maximize accuracy and is given its choice of two yes-no questions to be evaluated on, with signal accuracy held constant between them. It will prefer the question whose prior probability is farthest from one-half — because accuracy is easiest when the base rate is extreme. When almost everyone is in one category, predicting that category is already highly accurate before the signal does any work at all. The signal is most valuable, and most necessary, precisely when the prior is flattest. This is, I think, the right formal lens for understanding why large language models hallucinate. A model whose training has implicitly optimized accuracy-like objectives will have learned to be most confident on questions where its training distribution was most skewed — where the effective prior was farthest from one-half. The failure mode is that the model cannot distinguish “this question has a skewed prior because the answer is genuinely clear” from “this question has a skewed prior because my training data was unrepresentative.”⁶ Confident output on thin-signal questions follows. The negative threshold rule is not a bug. It is, by analogy, the solution to the problem the designer actually posed.⁷

Know When to Fold ‘Em (Reprise)

I have used Kenny Rogers as a closer at least twice already on this blog — the “Gambler” as a model of rational threshold behavior under uncertainty — so it is perhaps fitting that this post ends at his grave.

Kenny Rogers is buried at Oakland Cemetery in Atlanta, in a beautiful mausoleum not far from Bobby Jones. (Ed: “You buried the lede.” — I did not plan that. “Sure you didn’t.”) Oakland Cemetery is one of the great Atlanta institutions: founded in 1850, spatially organized by the hierarchies of its era — Confederate section, Jewish section, paupers’ field, politicians’ row — and home to mayors, civic figures, and at least one country music legend. It classifies its residents permanently, by a score assigned at death, against a threshold set by a designer who is also, eventually, in the dataset.

Life is a negative threshold rule with respect to age. The probability of the terminal classification — positive — increases monotonically as you accumulate evidence of having lived. There is no threshold you can stay below forever. The Gambler himself couldn’t hold ’em indefinitely, and Oakland Cemetery is where he rests.

With that, I leave you with this.⁸

¹ No Child Left Behind’s Adequate Yearly Progress structure is a rough but instructive real-world analog. Under NCLB’s “safe harbor” provision, a school could avoid failing AYP by reducing its share of below-proficient students by 10% from the prior year — an improvement-based threshold. The negative threshold property follows immediately: a school with 80% of students below proficiency needed only an 8-point absolute reduction to claim safe harbor; a school with 20% below proficiency needed only a 2-point reduction — but those 2 points were the hardest to find, because the remaining below-proficient students were the furthest from the cutoff. The improvement measure, designed to reward progress, quietly rewarded low baselines. This blog visited the paradox embedded in AYP measurement in a 2012 post; the structure has not changed.

Getting the Song Backwards (Or, AI := -1 * Goodhart * Bayes)

When the System Rewards Low Scores

When Accuracy Runs Backward

Know When to Fold ‘Em (Reprise)

Like this:

1 thought on “Getting the Song Backwards (Or, AI := -1 * Goodhart * Bayes)”

When the System Rewards Low Scores

When Accuracy Runs Backward

Know When to Fold ‘Em (Reprise)

Share this:

Like this:

1 thought on “Getting the Song Backwards (Or, AI := -1 * Goodhart * Bayes)”