Computational social scientist beware: Simpson’s paradox in behavioral data

Observational data about human behavior are often heterogeneous, i.e., generated by subgroups within the population under study that vary in size and behavior. Heterogeneity predisposes analysis to Simpson’s paradox, whereby the trends observed in data that have been aggregated over the entire population may be substantially different from those of the underlying subgroups. I illustrate Simpson’s paradox with several examples coming from studies of online behavior and show that aggregate response leads to wrong conclusions about the underlying individual behavior. I then present a simple method to test whether Simpson’s paradox is affecting results of analysis. The presence of Simpson’s paradox in social data suggests that important behavioral differences exist within the population, and failure to take these differences into account can distort the studies’ findings.

[1]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[2]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[3]  W. Marsden I and J , 2012 .

[4]  Kristina Lerman,et al.  Information is not a Virus, and Other Consequences of Human Cognitive Limits , 2016, Future Internet.

[5]  Kristina Lerman,et al.  Evidence of Online Performance Deterioration in User Sessions on Reddit , 2016, PloS one.

[6]  Daniel A. McFarland,et al.  Sociology in the Era of Big Data: The Ascent of Forensic Social Science , 2015, The American Sociologist.

[7]  Tad Hogg,et al.  Social dynamics of Digg , 2010, EPJ Data Science.

[8]  A. Yashin,et al.  Heterogeneity's ruses: some surprising effects of selection on population dynamics. , 1985, The American statistician.

[9]  Kristina Lerman,et al.  Understanding Short-term Changes in Online Activity Sessions , 2017, WWW.

[10]  Kristina Lerman,et al.  What Stops Social Epidemics? , 2011, ICWSM.

[11]  Kristina Lerman,et al.  Portrait of an Online Shopper: Understanding and Predicting Consumer Behavior , 2015, WSDM.

[12]  Kristina Lerman,et al.  How Visibility and Divided Attention Constrain Social Contagion , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[13]  C. Blyth On Simpson's Paradox and the Sure-Thing Principle , 1972 .

[14]  Krishna P. Gummadi,et al.  Quantifying Information Overload in Social Media and Its Impact on Social Contagions , 2014, ICWSM.

[15]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[16]  Kristina Lerman,et al.  The Simple Rules of Social Contagion , 2013, Scientific Reports.

[17]  P. Bickel,et al.  Sex Bias in Graduate Admissions: Data from Berkeley , 1975, Science.

[18]  Kristina Lerman,et al.  Can you Trust the Trend?: Discovering Simpson's Paradoxes in Social Data , 2018, WSDM.

[19]  H. James Norton,et al.  Simpson's paradox … and how to avoid it , 2015 .

[20]  Cameron Marlow,et al.  A 61-million-person experiment in social influence and political mobilization , 2012, Nature.

[21]  Kristina Lerman,et al.  Dynamics of Content Quality in Collaborative Knowledge Production , 2017, ICWSM.

[22]  Jure Leskovec,et al.  Human Decisions and Machine Predictions , 2017, The quarterly journal of economics.

[23]  Alex A. Freitas,et al.  Discovering Surprising Patterns by Detecting Occurrences of Simpson’s Paradox , 2000 .

[24]  Dan Cosley,et al.  Averaging Gone Wrong: Using Time-Aware Analyses to Better Understand Behavior , 2016, WWW.