Vitali Statistics: Measurability Issues in Education

This weekend, the Olympics drew our attention to those who leave everyone behind, leading us to question the nature of time itself (and I started thinking about algebra). So, I naturally began to think about measurement and education…

Recently, increased attention has been paid to the Obama Administration’s granting of waivers (or, “flexibility”) to states from the provisions of the No Child Left Behind Act of 2001 (NCLB).  The Act has been widely discussed since its passage at the beginning of the century, and I will focus only on one of its provisions (albeit arguably one of its most important).

CYA/Flame Retardant Provision. I am very aware acknowledge that these (both educational reform/performance in general and the NCLB in particular) are important, contentious, and complicated topics.  My point here is to illustrate a specific issue that I believe deserves some thought by those who are considering reform and/or reauthorization of NCLB.  

In a nutshell, NCLB requires states to develop standards by which their schools’ and school districts’ performances will be judged. I have a modest goal here: I will point out and try to explain a subtle but classic paradox hidden within one of the ways the NCLB calls upon states to measure educational success.

A key concept in NCLB is Adequate Yearly Progress (AYP).  This concept is measured at the school level for most elementary and high schools.  Without going into even more arcane details, it suffices to know that demonstrating achievement of AYP is desirable. I want to focus on what achieving AYP requires.

Specifically, in each year, tests are administered to students in reading, math, and science.  Waving at some details as we pass them by, success is essentially measured by the percentage of students passing each of these exams.  More importantly for our purposes, success rates must be measured in several ways.  For a given school, the success rates must be sufficiently high (and, generally, improving) in each of the following categories:

  1. all students,
  2. economically disadvantaged students,
  3. students from major racial and ethnic groups,
  4. students with disabilities, and
  5. students with limited English proficiency.
This design immediately raises the possibility of Simpson’s paradox, which can occur when comparing subpopulations with the population as a whole.  In this case, the relevant point is that an unambiguously improving school can still fail to satisfy AYP (and vice-versa).  Here is an example.

Suppose that a school has 100 students in both Years 1 and 2 and, for simplicity, consider only two “subgroups”: economically disadvantaged (“poor”) and not-economically-disadvantaged (“rich”) students.  Suppose that in Year 1, 20 of the school’s students were poor, and that 10 of these students “passed the exam,” whereas 72 of the 80 rich students passed the exam.  The school’s “scores” for Year 1 are then:
Poor: 10/20=50%.
Rich: 72/80=90%.
Total: 82/100=82%.

Now, in Year 2, suppose that 70 of the school’s students are poor, of whom 42 passed the exam, and all 30 of the rich students pass the exam. The school’s “scores” for Year 2 are then:

Poor: 42/70=60%.
Rich: 30/30=100%.
Total: 72/100=72%.
Uh oh. Viewed from a groups perspective, the school unambiguously improved its performance from Year 1 to Year 2 but viewed as a whole, the school’s performance has (similarly unambiguously) slipped.

The cause for the “paradox” is that the composition of the school changed between Years 1 and 2.  In year 2, the school gained students who had a lower success rate (even though, comparing apples to apples, this success rate increased) and lost students who had a higher (and also increased) success rate.  (Note that you can also construct this paradox only by altering the size of one of the groups.)

In a nutshell, it seems likely that the current construction of “Adequate Yearly Progress” might not measure what some of its proponents think it does.  Put another way, focusing on performance by subgroups (which is probably appropriate in this context and undoubtedly called for by the statute) immediately implies that this is an aggregation problem. Aggregation is a (or, perhaps, the) central question of political science.  But rather than get into that, I’ll simply leave you with this other formulation of Simpson’s paradox.

A Couple of Notes….
1. It should also be noted that others (e.g.Aldeman and Liu), have noticed a connection between Simpson’s paradox and educational testing, but I am unaware of anyone who has noticed the direct role of the paradox in the measurement of progress in the NCLB.
3. There are several other intriguing measurement aspects in both NCLB and the Obama Administration’s “Race to the Top” program.  Maybe I’ll write about them later.