Know When to Hold ‘Em (or, “what is AI?”)

There is a lot of noise about AI safety these days, and I want to contribute to it in a specific and, I hope, useful way. Maggie and I are spending this year at the Russell Sage Foundation working on, among other things, how to make our theoretical work on classifiers understandable to a broader audience — policymakers, journalists, students, people who use AI every day and would like to know something true about how it works. This post is the first in a series that builds toward a result we find genuinely unsettling. We are starting with the foundations, because the unsettling part only lands if the foundations are solid.1


What Is a Classifier?

This morning, before you left the house, you looked out the window and decided whether to bring an umbrella. You had some evidence — the color of the sky, a weather app, the fact that it rained yesterday — and you made a call: bring it or don’t. That is classification. You sorted yourself into one of two categories (“person who brings an umbrella today” or “person who doesn’t”), based on evidence about the state of the world.

Notice something about that decision: because you had already decided the question was “will it rain,” you were looking for an umbrella, not checking your sunglasses prescription. The question you were asking determined what counted as evidence, what you were sorting for, and what action you’d take at the end. That is what every classifier does: it commits to a question, gathers evidence bearing on that question, and acts on the answer.

A classifier is a rule for sorting things into categories. That’s the whole definition. Your spam filter is a classifier — it sorts emails into “spam” and “not spam.” A doctor reading an X-ray is a classifier. A loan officer reviewing a mortgage application is a classifier. A hiring committee deciding which candidates advance to the interview stage is a classifier. A judge deciding whether to grant bail is a classifier. When a parole board decides whether an inmate is ready for release, that’s classification. You have been a classifier every time you have decided whether to trust a review on Yelp, whether to take a call from an unknown number, or whether to go back inside for an umbrella.

Here is the first thing that sounds obvious once you say it out loud, and that matters enormously once you believe it: modern AI systems are classifiers. When a large language model generates text, it is at each step producing a probability distribution over possible next words and picking the one it ranks highest. That’s classification. Any system that maps inputs to outputs is, in a formal sense, a classifier: it sorts a space of possibilities and commits to one region of it. Your AI assistant is doing this thousands of times per response, choosing each word from everything it could possibly say next. Same structure as the umbrella. Same structure as the bail hearing. The stakes are different. The machinery is not.


What a Classifier Produces

Before we go further, we need to be precise about what a classifier actually produces, because most discussions of AI systems get this wrong, and the error matters.

A classifier does not simply decide. It first produces a score — a number that summarizes what the evidence supports. Your spam filter doesn’t just move an email to the junk folder; somewhere underneath that action, it assigned the email a score (say, an 87% probability of being spam) and then compared that score to a threshold to produce the decision. The score comes from the data and the model. The threshold comes from somewhere else entirely: from a judgment about which errors are more costly, made by whoever designed the system. That person or institution is the designer.

You interact with scores more directly than you probably realize. When you start typing a text message and your phone offers three suggested completions, those are the three options your phone’s model scored highest, given everything you’ve typed so far. When you ask an AI assistant for help with a complex task and it offers you a menu of approaches, those approaches are the top-scorers from whatever distribution the model is sampling. In each case, you see the winners. You don’t see the score that produced them, the threshold that determined how many winners to show, or the options that scored just below the cutoff and vanished without a trace.

That last point — the options that vanished — is one we will return to. An algorithm that always shows you the top-k options by score, and then learns from which of those options you select, is an algorithm that will never learn anything about the options it didn’t show. It cannot be audited on what it suppressed. The logic is uncomfortably close to something that has appeared in employment law: you cannot defend a hiring algorithm by saying no one from a particular group was hired, if the algorithm determined who got considered in the first place. A classifier that controls its own inputs controls what its own accuracy can be measured against.2

The distinction between score and threshold matters because the two can fail in different ways — and the failures have different causes and different cures. Call them the three failure modes. The first is a bad thermometer: a perfectly reasonable threshold applied to a biased or miscalibrated score. The line is drawn in the right place; the instrument feeding it is lying. The second is the wrong temperature: an accurate score compared to a badly chosen threshold. The instrument is fine; the line is in the wrong place. The third is the hardest to fix: a confident misdiagnosis. A well-intentioned designer who doesn’t understand the relationship between the score and the threshold will produce bad decisions while being certain she hasn’t. She may audit the score for bias and find none. She may defend the threshold as reasonable. But if she doesn’t understand how the two interact — how a threshold that looks neutral can operate very differently on a skewed distribution, or how a score that looks accurate can produce systematically harmful decisions at a particular threshold — she is practicing medicine without understanding what a fever actually means. We will see all three failure modes in what follows — including, I should warn you, in the footnotes, which in this post are doing real argumentative work rather than merely gesturing at sources. (Ed: “How many footnotes?” — Seven. “Seven.” — Seven.)


A Tour of Classifiers

Let’s look at classifiers in the wild, starting with the completely innocuous and ending somewhere more uncomfortable. The structure is the same in every case. What changes is the stakes, the identity of the designer, and who bears the cost of being classified wrong.

The umbrella. Score: your estimated probability of rain, assembled from whatever evidence you have available. Threshold: whatever level of rain-probability you personally find worth the inconvenience of carrying an umbrella. Designer: you. Stakes: mild inconvenience in either direction. This is the platonic form of a classification problem — low stakes, transparent structure, and you are simultaneously the designer, the classifier, and the person being classified. There is no conflict of interest, no power asymmetry, and no one to blame if you get it wrong except yourself.

The exam. Score: fraction of questions answered correctly. Threshold: whatever the professor or institution has decided counts as passing, proficient, or excellent. Designer: the professor, the department, the accrediting body, or some combination, depending on how deep you want to go. The score measures something real. But the threshold? The threshold is a design decision, and it is less natural than it appears. There is no deep reason why 90% is an A and 89% is a B rather than the other way around. This became vivid in 2014, when the North Carolina State Board of Education voted unanimously to move its public high schools from a seven-point grading scale (an A required 93 or above) to a ten-point scale (90 suffices). The underlying scores didn’t change. The students didn’t get smarter. The threshold moved, because the Board decided it should. A student who earned a 91 one year had a B on her transcript; the same score the following year earned an A. Economists who studied the change found that GPAs rose immediately — and actual learning, as measured by standardized tests and ACT scores, did not. Lower-performing students responded to the looser threshold by missing more school.3 The score and the threshold came apart, and when they did, the threshold’s effect on behavior turned out to matter more than anyone had anticipated.

The spam filter. Score: probability that this email is spam, computed from the content, the sender, the metadata. Threshold: whatever probability the filter’s designers have decided warrants diversion to the junk folder. Designer: the engineers who built the filter, making a judgment about which error is worse. And here we have to think carefully, because the two kinds of errors don’t cost the same. A false positive — a real email diverted to spam — can cost you a missed appointment, a delayed diagnosis, a professional embarrassment. A false negative — a piece of spam that gets through — costs you two seconds of attention while you delete it. For most users, false positives are considerably more costly, so spam filter designers set their thresholds high: let some spam through rather than risk burying legitimate mail. This is the right call, but notice what it reveals: the threshold encodes a value judgment about whose errors matter more. It is not derived from the data. It is imposed on the data by the designer.

The sales quota. Score: units moved this quarter, revenue generated, accounts closed. Threshold: the number your manager set in January, when the quarter seemed far away. Designer: your employer. Stakes: your job. The quota is a positive threshold in its purest binary form: you made it or you didn’t. Notice the structure of power that has entered. In the umbrella case, you were designer, classifier, and classified person simultaneously. In the exam case, the designer and the person being classified were in an educational relationship with some nominal alignment of interests. In the quota case, the designer and the person being classified have different and sometimes opposed interests: the company wants to maximize revenue; the salesperson wants to keep their job. The threshold is set to serve the designer’s interest, and the classified person has to navigate it.

What the salesperson navigates, the customer absorbs. Jansen, Nguyen, Pierce, and Snyder studied monthly sales bonus thresholds in the auto industry and found that loans closed at the end of the month — when salespeople are closest to their threshold and most motivated to close deals — default at meaningfully higher rates than loans closed earlier. The customers who took those loans were being classified by a separate system entirely (the loan application process), but the pressure created by the sales quota threshold was quietly shaping which customers got pushed through that system and how hard. The cost of the quota fell on the people who were never told it existed.4

The breathalyzer. Score: blood alcohol content, measured in grams per deciliter of breath. Threshold: 0.08 in most U.S. jurisdictions, established by statute. Designer: the legislature. Stakes: a DUI charge, a suspended license, a criminal record. The breathalyzer is interesting because the threshold is not set by the person administering the test or the agency running the program — it is set by law, which means it was set by a political process, which means it reflects whatever coalition of interests was strong enough to put 0.08 into the statute rather than 0.07 or 0.10. The score is chemistry. The threshold is politics. If you are stopped at 0.079, you are sober in the eyes of the law; at 0.081, you are legally impaired. The line is not drawn by the data.5

The credit score. Score: a number summarizing your credit history — payment records, debt levels, length of accounts, inquiries. Threshold: whatever a lender has decided marks the boundary between creditworthy and not, for this loan, at this rate. Designer: the lender, working within regulatory constraints, drawing on a scoring model built by a credit bureau. Stakes: a mortgage, a car loan, an apartment, and in some states a job. Here is where the equity implications become impossible to ignore. Credit history is a proxy for creditworthiness. But from the 1930s through the 1960s, federal housing programs and private lenders systematically denied mortgages to residents of majority-Black neighborhoods — a practice called redlining. Families excluded from homeownership couldn’t build the intergenerational wealth that homeownership produces, and they couldn’t build the credit history that comes from holding a mortgage. Their children and grandchildren entered the credit system decades later with thinner files — not because of anything about their behavior, but because of the history of the financial products available to them. A credit scoring algorithm trained on this data doesn’t need to use race as a feature to produce racially disparate scores. It needs only to use credit history, which already carries the imprint of a system the algorithm neither created nor can see. The score looks objective. The threshold looks neutral. But the distribution the threshold is applied to was shaped by decades of exclusion, and a threshold set anywhere on that distribution will reflect it.6

Pretrial risk assessment. Score: an algorithmic risk score computed from criminal history, age, employment, residence, and other factors, intended to estimate the probability that a defendant will fail to appear for trial or be rearrested while awaiting it. Threshold: whatever the jurisdiction has set as the line between “detain” and “release,” which varies by jurisdiction, by judge, by the political moment. Designer: a combination of the algorithm’s developers, the jurisdiction’s policymakers, and the individual judge who applies the tool. Stakes: whether you sleep in a cell tonight or go home to your family. (Ed: “And ‘the designer’ here is everyone and no one, which is a different kind of problem.” — Yes. It is.) The person who scores just above the threshold is detained; the person who scores just below goes home. Both were scored by the same algorithm, applied the same threshold. Whether you end up on which side of that line depends on factors — your criminal history, your age, your zip code — that correlate with race in ways the algorithm does not acknowledge and the designer may not have intended. A false positive in this setting means a person who would not have missed trial or reoffended is sitting in jail, losing their job, losing their housing, losing their family stability, while their case moves through the system. A false negative means someone who would have reoffended is free. The designer set a threshold that determines how those errors are distributed. That threshold is not in the data.7


Know When to Hold ‘Em

Seven classifiers, from umbrella to jail cell. The structure is identical in every case: evidence comes in, a score comes out, a threshold converts the score into a decision, and the decision has consequences for the person classified. What varies is the stakes, the identity of the designer, and the degree to which the score and the threshold can be held apart and examined separately.

Now we can name what all of these have in common, which turns out to be something folk wisdom has understood for a long time. In every case, the classifier is doing the same thing: it is forming a belief — a probability assessment, based on available evidence, about the state of the world — and acting when that belief crosses a line. Statisticians call this belief your posterior, meaning your probability estimate after incorporating the evidence, as opposed to your prior, which is where you started before seeing it. The crucial feature of every example above is that the threshold runs in the direction of the evidence. Higher probability of rain: bring the umbrella. More evidence of spam: divert the email. Higher risk score: detain. Higher BAC: charge. In each case, the classifier says yes when the evidence is strong and no when it isn’t.

Make hay while the sun shines. Keep your powder dry until you’re close enough to be sure. Know when to hold ’em, know when to fold ’em. These are positive threshold rules in aphoristic form: observe, update your belief, act when the belief is strong enough and not before. The threshold runs with the evidence, not against it. This is so intuitive — it is, in some sense, what the word “rational” means in everyday usage — that it barely seems worth naming.

But it is worth naming. A rule with a name can be violated. And next time, I want to show you what happens when the training objective for an AI system produces a classifier that violates this rule — not by accident, not because of a bug, but as the optimal solution to the objective the designer actually specified. Maggie and I have a result that says, under the right conditions, accuracy-maximizing classifiers will act against the direction of the evidence: they will say yes when the evidence is weakest and no when it is strongest. We call this a negative threshold rule, and it is, we think, the right formal lens through which to understand why AI systems hallucinate — why they express the most confidence precisely when they are the least certain.

More on that next time. For now: know when to hold ’em. (We’ll talk next time about what happens when the algorithm gets the song backwards.)

With that, I leave you with this.


1 The foundational technical results are in our forthcoming AJPS article, “Classification Algorithms and Social Outcomes” — Maggie and I wrote a short non-technical summary for the AJPS blog. Related and ongoing work: Penn and Patty (arXiv:2511.08347), Penn (arXiv:2504.06127), Patty and Penn (arXiv:2505.18094), and Patty and Penn (arXiv:2312.03155). The Russell Sage Foundation has generously supported this work.

2 This is related to the exploration/exploitation tradeoff in the multi-armed bandit literature, and to the endogenous base rate problem in our own work — if a system only exploits its current best estimates it never learns whether those estimates are right, and cannot be audited for systematic bias in what it chose not to show. The employment law connection runs through Griggs v. Duke Power Co. (1971), which established disparate impact as a basis for discrimination claims: a facially neutral selection criterion that disproportionately excludes a protected group requires business justification. The algorithmic version is sharper: you cannot provide that justification for outcomes you never measured, on applicants your algorithm never surfaced.

3 Bowden, Rodriguez, and Weingarten, “The Unintended Consequences of Academic Leniency,” AEJ: Economic Policy. The paper exploits the NC grading scale change as a natural experiment and finds that the mechanical GPA increase was accompanied by significant increases in absenteeism — concentrated entirely among lower-ability students — and that these behavioral differences compounded over time, widening long-run achievement gaps as measured by ACT scores. The higher-ability students got the GPA boost; the lower-ability students, for whom the marginal incentive to show up had just been reduced, stopped showing up. This is, among other things, a clean illustration of the fact that a threshold is not a neutral line drawn on a neutral distribution — it shapes the behavior of the people on either side of it. The NC case also illustrates the “wrong temperature” failure mode: the Board changed the threshold without touching the score, and the behavioral consequences landed on exactly the students the change was meant to help.

4 Jansen, Nguyen, Pierce, and Snyder, “Product Sales Incentive Spillovers to the Lending Market: Evidence from Subprime Auto Loan Defaults,” Management Science 70(8), 2024. The paper shows that loans closed at the end of the month — when salespeople are closest to their bonus threshold — default at rates roughly 10% higher over 24 months than loans closed earlier, with the highest-payment-to-income customers seeing default rates rise from 13.6% to 19.7% on the last day of the month. Lenders who purchased the loans showed no evidence of being harmed by the default increase; the costs were borne entirely by consumers. Pierce is at Olin (WashU) and Snyder at Utah’s Eccles School — two of the sharpest researchers working on how incentive structures shape misconduct.

5 There is an apparent contradiction worth naming. The Fifth Amendment protects against compelled self-incrimination; a breathalyzer result incriminates you; so compelling you to produce one — by penalizing refusal — seems to violate the Amendment. Courts dissolved this tension not by arguing that breath tests don’t incriminate, but by classifying breath as physical evidence rather than testimonial evidence. The Fifth Amendment, per Schmerber v. California (1966), only reaches compelled testimony. Blowing into a tube produces a number your body generates; you haven’t asserted anything. The constitutional protection was defined so as to exclude the very thing it might seem to protect against. The Fourth Amendment question — can the state conduct this search without a warrant? — was resolved in Birchfield v. North Dakota (2016), where the Court held that breath tests pass as searches incident to arrest. Since there is no constitutional right to refuse a valid search, criminalizing refusal doesn’t criminalize a constitutional right. The circle closes. Notice what closed it: a series of classification decisions made by courts, each one a threshold set to preserve the state’s ability to compel the score. The classified person’s ability to opt out was eliminated through careful categorization, not force. We will see this move again.

6 A credit score is more specific than it appears and more general than it seems. The specific part: what FICO’s core models actually predict is the probability that you will be 90 or more days past due on any credit obligation within the next 24 months. Not “are you creditworthy” in some general sense — a precise behavioral prediction with a precise horizon. The general part: this 90-day delinquency signal turns out to be predictive of a remarkable range of seemingly unrelated outcomes. Insurers use credit-based scores to predict insurance claims, and the correlation holds up empirically — FICO’s own documentation notes that people with lower credit scores file more claims on average, while acknowledging that no causal relationship has been established and that causal relationships are not, in any case, what insurance underwriting requires. Researchers have found that credit scores predict cardiovascular disease risk, and that the likely mechanism is that scores capture accumulated human capital — educational attainment, cognitive ability, self-control — established in childhood and expressed across many domains of adult behavior. A life insurance industry veteran, on learning that a researcher studied self-control and life outcomes, reportedly said: “We do that too, but we use credit scores.” Meanwhile, contrary to the “one score” myth, dozens of FICO models circulate simultaneously — different versions for mortgages, auto loans, credit cards, and general use, often drawing on different bureau data, so that your score from Equifax may differ meaningfully from your score from TransUnion. The three bureaus (Equifax, Experian, TransUnion) are heavily regulated — by the Fair Credit Reporting Act, the Equal Credit Opportunity Act, and CFPB oversight — but notably without any regulatory specification of what probability of default a score of 700 should correspond to, or what the score should be understood to be measuring beyond its official 90-day prediction target. The score is a proxy for something real that nobody has fully named, regulated for purposes that don’t require naming it, and used for applications — employment, housing, insurance, health risk — that go well beyond the thing it was designed to predict. This is the “confident misdiagnosis” failure mode applied to an entire industry.

7 The canonical public debate about algorithmic risk assessment in criminal justice involves a tool called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), deployed in Broward County, Florida and extensively analyzed in a 2016 ProPublica investigation. The dispute between ProPublica and the algorithm’s developer, Northpointe, looked superficially like a dispute about whether COMPAS was racially biased. It was actually a dispute about the threshold — specifically, about which definition of fairness the threshold should be set to optimize. ProPublica found that Black defendants were nearly twice as likely as white defendants to be falsely labeled high-risk (a false positive disparity); Northpointe responded that the algorithm’s predictions were equally accurate for both groups (calibration parity). Both claims were correct. They were measuring different things.

The deeper point, established mathematically by Chouldechova (2017) and Kleinberg, Raghavan, and Mullainathan (2016), and elaborated in our own work, is that when two groups have different base rates of the outcome being predicted — here, recidivism — it is mathematically impossible for a classifier to simultaneously equalize false positive rates across groups and maintain equal predictive accuracy across groups. You can satisfy one; you cannot satisfy both. Adjusting the threshold to fix the false positive disparity breaks the calibration; restoring the calibration breaks the false positive parity. The impossibility doesn’t disappear — it relocates. The threshold determines who bears the cost of which errors, and no threshold can make those costs equal in every sense when the underlying distributions differ.

Maggie and I have worked on this problem extensively, examining both the structure of optimal classification under various fairness criteria and how those criteria interact with individuals’ behavioral responses to being classified. We show, among other things, that the optimal algorithm under a single uniform threshold — one that treats all defendants identically regardless of group — maximizes public safety while satisfying a meaningful form of equality, but that this optimum generally differs from what fairness-constrained algorithms produce, creating genuine and unavoidable tension between public safety and prevailing notions of algorithmic fairness. See Patty and Penn, “Algorithmic Fairness and Statistical Discrimination,” Philosophy Compass 18(1), 2022; “Algorithmic Fairness with Feedback,” arXiv:2312.03155; and the AJPS paper cited in footnote 1.