All Measurements Are Local

Scientific American ran a piece yesterday on Simpson’s paradox — the phenomenon in which a trend that holds within every subgroup of a dataset reverses, or vanishes entirely, when those subgroups are combined. Regular readers of this blog will have encountered it before. In 2012, it showed up inside No Child Left Behind: a school could be improving in every demographic subgroup it served and still fail the federal standard for Adequate Yearly Progress, because the standard aggregated subgroup results in a way that obscured the underlying trend. Earlier this month, it appeared twice in quick succession: first in All Statistics Are Local, where national incarceration rates were simultaneously falling and rising depending on whether you looked at the aggregate or at individual counties; and then in Your Basket May Vary, where the Consumer Price Index was accurately measuring inflation for a constructed average consumer who doesn’t exist while misrepresenting it for nearly everyone who does.

Wikimedia Commons, CC BY-SA 4.0

The Scientific American piece covers the Berkeley admissions case from the 1970s, a COVID mortality comparison between Italy and China, and a drug trial where a medication outperforms placebo overall but underperforms it for both men and women separately. It ends by noting that there is “no universal answer” to the question of which result to trust, and recommends further research. That’s correct, as far as it goes. But I want to explain why there’s no universal answer — because the reason is a theorem, not a limitation of available data.


The Riddle

Here is the drug trial example, made precise. You have two groups of patients — call them Group A and Group B — and you give some patients in each group the drug and some the placebo. Here are the recovery rates:

DrugPlacebo
Group A20/40 = 50%15/40 = 37.5%
Group B70/80 = 87.5%45/60 = 75%
Combined90/120 = 75%60/100 = 60%

Wait. The drug beats placebo in Group A (50% vs. 37.5%), and it beats placebo in Group B (87.5% vs. 75%). And it beats placebo overall (75% vs. 60%). No paradox here — everything points the same direction. Let me adjust the numbers.

DrugPlacebo
Group A20/40 = 50%30/50 = 60%
Group B70/90 = 78%3/10 = 30%
Combined90/130 ≈ 69%33/60 = 55%

Now the drug loses to placebo in Group A (50% vs. 60%). It wins in Group B (78% vs. 30%). And it wins overall (69% vs. 55%). So: should you approve the drug? The subgroup results disagree with each other. The aggregate result agrees with one of them and disagrees with the other. Which number is telling you the truth?

The honest answer is: all of them are. Each number accurately summarizes the data it was computed from. The paradox arises not because anyone made an arithmetic error but because Group B is much larger than Group A, and the drug-assigned patients were disproportionately concentrated in the group where the drug performs better. The aggregate is real; the subgroup results are real; they are contradictory; and the contradiction is not resolvable by collecting more data of the same kind.

What the contradiction reveals is that there is a choice embedded in the analysis — a choice about how much weight to give each subgroup. Weight by patients, and you get the aggregate. Weight by groups equally, and you average the subgroup results. Weight by something else — by pre-treatment severity, by age, by income — and you get yet another answer. Each weighting scheme is a legitimate statistical procedure. Each produces a different conclusion.

That choice is not a statistical question. It is a normative one.


Enter Arrow

In 1951, Kenneth Arrow proved what is now called the General Possibility Theorem — known more commonly, and more candidly, as Arrow’s Impossibility Theorem. The theorem is usually introduced in the context of voting, which is understandable but unfortunate, because the voting framing makes it easy to file the result under “interesting fact about elections” and move on. The real subject of Arrow’s theorem is measurement: specifically, the problem of coherently summarizing multiple signals into a single answer. Elections are one application. Drug trials are another. Inflation indices are another. Network analysis is another. Arrow’s result applies to all of them for the same reason, and understanding why requires leaving the voting booth behind entirely.1

The axiomatic method — the approach Arrow used, and which Maggie Penn and I have been developing in various forms since our book Social Choice and Legitimacy (Cambridge University Press, 2014) — works like this. You want to measure something. Rather than jumping straight to a formula, you ask: what properties would any reasonable measurement procedure have to satisfy? You write those properties down as axioms. Then you ask: is there a procedure that satisfies all of them simultaneously? And if not: which ones are mutually incompatible?

Here is a concrete, non-electoral illustration of what this looks like in practice. Suppose you want to measure how “central” or “important” a given node is in a network — say, which members of Congress are the most influential based on their co-sponsorship ties, or which proteins in a biological network are the most connected. The obvious measure is degree centrality: just count how many connections each node has. Seems innocent enough. But before adopting it, the axiomatic approach asks: what properties does degree centrality satisfy, and are those the properties we actually want?

It turns out that degree centrality is the unique measure satisfying three specific axioms: anonymity (nodes with the same connections get the same score), positive responsiveness (adding a connection increases your centrality), and independence of non-dominated arcs (your centrality relative to mine shouldn’t depend on connections that neither of us has). Change any one of those axioms — decide, say, that you care not just about how many connections a node has but about the quality of those connections — and degree centrality is no longer the right measure. A different set of axioms produces closeness centrality, or betweenness centrality, or eigenvector centrality. Each is internally coherent. Each answers a slightly different question. And the question of which one is “correct” is not a statistical question — it’s a question about what you’re trying to measure, which is a normative question.2

This is the axiomatic method applied to data analysis, and it is exactly the method Arrow applied to aggregation. But before listing Arrow’s conditions on the aggregation procedure, it is worth pausing on the condition he imposed on the aggregation’s output — because this is the condition that makes the theorem so much more general than its electoral framing suggests.

Arrow required that the output be a transitive ordering. Transitivity means: if the aggregate says X beats Y, and Y beats Z, then the aggregate must say X beats Z. This sounds obvious — of course a ranking should be transitive. But the whole point is that it is not obvious at all that aggregation procedures will produce it. The canonical illustration is the Condorcet paradox: with three candidates A, B, C and three voter groups, majority voting can produce A beats B, B beats C, and C beats A simultaneously. The majority “prefers” A to B, B to C, and C to A. That is not a ranking. It is a cycle, and a cycle is not a measurement. You cannot use it to make a decision, generate a scale, or compare alternatives in any stable way.

Arrow’s project was precisely to ask: what conditions on an aggregation procedure would guarantee a transitive output? His answer — which is the theorem — is that no procedure satisfying the four conditions below can guarantee transitivity in general. The four conditions are what you’d want from any reasonable aggregation. Transitivity is what you need to have a measurement at all. The theorem says you can’t always have both.

This is why Arrow’s result applies to drug trials, inflation indices, and network centrality measures just as much as to elections. Whenever you are trying to produce a coherent, transitive summary of multiple signals — a ranking, a score, an index — you are in Arrow’s framework. The conditions on the procedure (below) are conditions any reasonable aggregation should satisfy. Transitivity is the condition any reasonable measurement must satisfy. Arrow showed these are mutually incompatible in general. That is what makes the theorem about measurement, not just about voting.

Arrow’s four conditions on the aggregation procedure are:

Unrestricted Domain. The procedure should work for any possible configuration of inputs — not just well-behaved ones. A procedure that requires cooperative data before it produces a coherent answer is not a procedure. It’s a hope.

Pareto. If every input agrees that X beats Y, the output should say X beats Y. This is close to the weakest possible condition of responsiveness: if there is unanimous agreement, the procedure should agree too.

Independence of Irrelevant Alternatives. Whether the output ranks X above Y should depend only on how the inputs rank X against Y directly — not on where some third option Z happens to fall, and not on any feature of the situation other than the direct X-versus-Y comparisons. In the drug trial context: whether the aggregate verdict on the drug comes out positive or negative should depend only on what each subgroup says about the drug. Not on how large the subgroups happen to be.

Non-Dictatorship. No single input should automatically determine the output regardless of all the others.

Arrow’s result: no procedure satisfies all four simultaneously. For any aggregation of three or more inputs across three or more options, something has to give.

Now return to the drug trial. The inputs are the two subgroup verdicts: Group A says the drug loses; Group B says it wins. We need an aggregation procedure.

Option 1: weight by number of patients. Group B is larger, so the aggregate says the drug wins. This satisfies Pareto — if both groups had agreed, the aggregate would agree too. But it violates Independence of Irrelevant Alternatives. The aggregate verdict on drug-versus-placebo now depends on how many patients ended up in each group, which is not a drug-versus-placebo comparison. Change the enrollment numbers without changing anyone’s recovery rates, and the verdict can flip. Group sizes are, in Arrow’s language, irrelevant to whether the drug works — but they are determining the answer.

Option 2: weight the two groups equally, regardless of size. This respects Independence of Irrelevant Alternatives — the aggregate depends only on what each group says about the drug, not on how large each group is. But now we have a tie: one group says yes, one says no, and we need a tiebreaker. Any tiebreaker either designates one group as the dictator (violating Non-Dictatorship) or introduces some additional criterion external to the direct comparison (violating Independence of Irrelevant Alternatives). We are back where we started.

Option 3: weight by some substantive criterion — pre-treatment severity, age, income, the variable you believe explains why the groups differ. This can produce a coherent answer, but it requires choosing the criterion, and different criteria produce different verdicts. Each choice is defensible. None is uniquely correct. You have satisfied Unrestricted Domain and Non-Dictatorship, but you have reintroduced dependence on something external to the direct drug-versus-placebo comparisons — Independence of Irrelevant Alternatives falls again.

Every path violates at least one property you’d want an aggregation procedure to satisfy. The impossibility is not a failure of data quality or study design. It is a structural feature of what it means to aggregate — in drug trials, in network analysis, in inflation measurement, and yes, in elections too.

This is why the Scientific American piece is right that there is “no universal answer” — but the reason runs deeper than statistical complexity. It is not waiting to be resolved by a better study design or a larger sample. It is a theorem.


Which brings us back to the CPI

The previous post in this series, Your Basket May Vary, argued that the Consumer Price Index is one weighting scheme among many — it weights price changes by the expenditure patterns of a constructed “average urban consumer,” which systematically underweights the inflation experience of low-income households and overweights categories dominated by higher-income spending. The post stopped short of explaining why the BLS cannot simply fix this by using a better weighting scheme.

Here is why: there is no weighting scheme that correctly represents everyone’s inflation experience simultaneously. If the BLS weights by expenditure share, households with below-average expenditure on a category are underrepresented in that category’s contribution to measured inflation. If the BLS weights equally across households, it gives equal weight to a household spending $200 a month on food and one spending $2,000 a month on food, which accurately represents neither. If it produces separate price indices for different income quintiles — as it does, quietly, through the CPI-W and the R-CPI-E — it must then face the question of how to aggregate those indices into a headline number that policymakers and journalists actually use. And at that point the problem recurses: we are back to Simpson’s paradox, one level up.

The political valence of this is not subtle. When the administration says inflation is under control and households say it isn’t, they are not necessarily lying or innumerate. They may be accurately reporting what different legitimate aggregation schemes produce. Arrow’s theorem doesn’t tell us who is right. It tells us the question “what is the inflation rate?” has no uniquely correct answer — only answers that are correct relative to a weighting scheme, and weighting schemes that are choices, not facts.

The measurement is not broken. The measurement is doing exactly what it was designed to do. The design is a normative choice. And the normative choice is — in the formal sense Arrow gave us in 1951 — irreducibly political.


With that, I leave you with this.


1 This is the argument Maggie Penn and I develop at length in Social Choice and Legitimacy: The Possibilities of Impossibility (Cambridge University Press, 2014): that the classic impossibility results of Arrow, Gibbard, and Satterthwaite are not indictments of democratic governance but explanations of why democratic institutions are structured the way they are. Impossibility theorems don’t tell us that legitimate governance is impossible; they tell us that pure aggregation alone cannot be enough. You also need procedures, and reasons, and accountability — which is what institutions are for.

2 The axiomatic characterization of degree centrality is due to van den Brink and Gilles (2003). Maggie Penn and I discuss this result, along with axiomatic approaches to clustering algorithms and matching procedures, in “Analyzing Big Data: Social Choice and Measurement” (PS: Political Science & Politics, 2015). The broader argument — that the axiomatic method applies not just to voting but to the full range of measurement and data analysis problems social scientists actually face — is developed further in our 2019 Annual Review of Political Science piece as well. The point is always the same: before asking “which formula should I use?”, ask “what properties do I want the formula to satisfy?” The answer to the second question usually determines the answer to the first — and sometimes reveals that no formula satisfies all the properties you wanted.

5 thoughts on “All Measurements Are Local”

  1. Opinions on Walmart’s shift to digital shelf labels? Do you think it will actually happen? Are you concerned or is this kind of a nothing.

Comments are closed.