The Bigger The Data, The Harder The (Theory of) Measurement

We now live in a world of seemingly never-ending “data” and, relatedly, one of ever-cheaper computational resources.  This has led to lots of really cool topics being (re)discovered.  Text analysis, genetics, fMRI brain scans, (social and anti-social) networks, campaign finance data… these are all areas of analysis that, practically speaking were “doubly impossible” ten years ago: neither the data nor the computational power to analyze the data really existed in practical terms.

Big data is awesome…because it’s BIG.  I’m not going to weigh in on the debate about what the proper dimension is to judge “bigness” on (is it the size of the data set or the size of the phenomena they describe?).  Rather, I just wanted to point out that big data—even more than “small” data—require data reduction prior to analysis with standard (e.g., correlation/regression) techniques.  More generally, theories (and, accordingly, results or “findings”) are useful only to the extent that they are portable and explicable, and these each generally necessitate some sort of data reduction.  For example, a (good) theory of weather is never ignorant of geography, but a truly useful theory of weather is capable of producing findings (and hence being analyzed) in the absence of GPS data. A useful theory of weather needs to be at least mostly location-independent.  The same is true of social science: a useful theory’s predictions should be largely, if not completely, independent of the identities of the actors involved.  It’s not useful to have a theory of conflict that requires one to specify every aspect of the conflict prior to producing a prediction and/or prescription.

Data reduction is aggregation.  That is, data reduction takes big things and makes them small by (colloquially) “adding up/combining” the details into a smaller (and necessarily less-than-completely-precise) representation of the original.

Maggie Penn and I have recently written a short piece, tentatively titled “Analyzing Big Data: Social Choice & Measurement,” to hopefully be included in a symposium on “Big Data, Causal Inference, and Formal Theory” (or something like that), coordinated by Matt Golder.[1]

In a nutshell, our argument in the piece is that characterizing and judging data reduction is a subset of social choice theory.  Practically, then, we argue that the empirical and logistical difficulties with trying to characterize the properties/behaviors of various empirical approaches to dealing with “big data” suggest the value of the often-overlooked “axiomatic” approaches that form the heart of social choice theory.  We provide some examples from network analysis to illustrate our points.

Anyway, I throw this out there to provoke discussion as well as troll for feedback: we’re very interested in complaints, criticisms, and suggestions.[2]  Feel free to either comment here or email me at

With that, I leave you with this.

[1] The symposium came out of a roundtable that I had the pleasure of being part of at the Midwest Political Science Association meetings (which was surprisingly well-attended—you can see the top of my coiffure in the upper left corner of this picture).

[2] I’m also always interested in compliments.