What the Dashboard Didn’t Show You (Or, “The Denominator Moved”)

Roosevelt Elementary started Year 1 of Elevate with 100 students. It ended the year with 400. The other two schools held roughly steady. The district grew from 600 students to 900, and the composition of that denominator shifted decisively toward the lowest-scoring school. That one fact explains the entire dashboard.

Elevate worked. Every school’s average SAPA score went up three points. Jefferson, whose students started at 85, finished the year at 88. Lincoln went from 70 to 73. Roosevelt, the lowest-scoring school, went from 55 to 58. This is not a subtle effect or a statistical shimmer. A three-point within-school gain is large for a single year of a new curriculum, and it is uniform across schools whose baseline populations could hardly be more different. By any measure focused on what the curriculum does to the students it reaches, the curriculum is working.

The district composite fell anyway. It fell because the composite is a weighted average over current enrollment, and current enrollment is not what it was when the baseline was measured. Roosevelt’s share of the district went from one-sixth to nearly half. The same three schools were being averaged at the end of the year as at the beginning, but the weight the lowest-scoring school contributed to the composite quadrupled. A weighted average of three improving averages can fall when the weights shift toward the lowest of the three.1 That is what happened.


This is Simpson’s paradox with a specific mechanism named. In “All Measurements Are Local,” I argued that no aggregation procedure can settle which number to trust when subgroups and aggregate disagree — the question is normative, not statistical, and Arrow guarantees that whatever rule you pick will violate a property some reasonable person considers indispensable. That post was about which number is correct. This one is about something different, and in some ways harder: why the numbers disagree in this particular way, and what the disagreement tells you about the intervention that produced it.

The disagreement in Northbrook is not statistical noise. It is the signature of a specific class of intervention: one that expands access at a site whose baseline was below the district average. Any curriculum, program, algorithm, or classification rule that changes both outcomes and the composition of who is exposed to it will produce this pattern whenever the new entrants skew toward the low end of the existing distribution. That is not a failure of the intervention. For an intervention whose stated goal is equitable access, it is a success condition. The composite, however, is not equipped to describe it as one.


The technical name for what Elevate did to Northbrook — not to student scores, but to the distribution of students across the district — is endogenous composition. The composition of the group you are measuring is not a fixed feature of the world. It is partly produced by the intervention you are measuring. Maggie and I have been developing this point, in the setting of algorithmic classification, for some years now; our recent AJPS paper makes the case formally. The short version is this: classification systems do not merely measure the populations they are applied to. They reshape them. The population you observe after the system has been in place is not the population you would have observed in its absence — and the difference is not noise. It is the system working.

Northbrook is an educational version of that dynamic. Elevate did two things simultaneously: it improved outcomes at each school, and it changed who was enrolled at each school. The composite was designed to measure the first of those things, and it absorbed the second without flagging that it had done so. To a reader looking only at the composite, the curriculum looks like a failure. To a reader looking only at the school-level gains, the curriculum looks like a uniform success. Both readers are accurately reporting what they see. Neither has enough information to tell you what happened.


The information that would have told you what happened was in the dashboard’s sidebar, under the heading “District demographics,” in small italic type marked “Baseline year figures.” The sidebar never updated. When the school cards appeared on the drill-down, they showed Year 1 SAPA against baseline SAPA; they did not show Year 1 enrollment against baseline enrollment. The dashboard was, in this sense, not hiding anything. It was reporting exactly what public-facing dashboards routinely report, which is outcome data conditioned on a composition that is treated as if it were stable. The composition wasn’t stable. No public report of Elevate’s first year would have made you suspect otherwise unless you went looking.

This is the piece that algorithmic audits collect and that public accountability reports typically omit. An audit of a classification system that has changed access asks, as a first question, who got exposed to the system in Year 1 compared to Year 0, and whether the change in exposure is correlated with the change in outcomes. The audit’s working assumption is that composition is a treatment, not a fixed feature of the population. Public reports, by contrast, typically treat composition as background. They report the aggregate and note, perhaps, that enrollment grew. They rarely model the aggregate as a function of enrollment composition, because doing so would require the reporter to claim that the intervention changed who was being counted — a claim whose accounting implications, especially around accountability ratings tied to aggregate performance, are considerable.


One thing worth saying about the poll itself, though not about how anyone voted. The three options corresponded not to three numbers but to three different implicit drawers, in the sense the current series has been developing. Option A — the composite — treats the district as a single accountable unit and asks whether that unit is improving overall. Option B — the within-school gain — treats each school as its own unit and asks whether the intervention is working within it. Option C — the school gap — treats inter-school inequality as the object of interest. These are three different classifications of what Elevate’s first year is. The dashboard did not ask which one to use. It presented all three and made you choose.

Choosing one of the three without the enrollment data is like grabbing a key from the junk drawer without knowing what it opens. Each metric answers a different question, and each question has a legitimate constituency — boards care about composites for accountability ratings, principals care about within-school gains for instructional reasons, equity advocates care about gaps. When the underlying composition shifts, the answers to the three questions don’t just differ in degree. They can point in different directions. There is no aggregate metric, and no voting procedure over the three, that resolves this cleanly. The impossibility is structural, and it is the same structural impossibility Arrow’s theorem describes in voting: no rule satisfies all the properties any reasonable person would want a rule to satisfy.


What the dashboard didn’t show you was that the intervention it was measuring had changed what the dashboard was a dashboard of. The composite wasn’t lying. It was accurately reporting a weighted average over a denominator that Elevate had, as part of its intended operation, redrawn. Roosevelt’s enrollment quadrupled because Elevate was designed to expand access at Roosevelt. The district composite fell because Roosevelt’s new weight in the district pulled the average toward the lowest-scoring school, whose students had improved by the same three points as everyone else but started three dozen points lower. Every number was correct. None of them, alone, was sufficient.

The conservation of impossibility holds. You do not get to resolve Simpson’s paradox by picking the right number. You get to decide which question you are answering, and to acknowledge — in the report, not in a footnote — that the intervention being evaluated changed the composition of the group being measured. That acknowledgment does not eliminate the ambiguity. It makes the ambiguity honest.

With that, I leave you with this.


1 For readers who want to verify: the baseline composite was (85×200 + 70×300 + 55×100)/600 = 72.5. The Year 1 composite, with Jefferson and Lincoln enrollments roughly stable and Roosevelt at 400, is (88×200 + 73×300 + 58×400)/900 ≈ 69.7. Every school improved by three points. The enrollment weights shifted. Both facts survive arithmetic.

Hit Me...