The Illusion of Change: Correcting for Biases in Change Inference for Sparse, Societal-Scale Data

Societal-scale data is playing an increasingly prominent role in social science research; examples from research on geopolitical events include questions on how emergency events impact the diffusion of information or how new policies change patterns of social interaction. Such research often draws critical inferences from observing how an exogenous event changes meaningful metrics like network degree or network entropy. However, as we show in this work, standard estimation methodologies make systematically incorrect inferences when the event also changes the sparsity of the data. To address this issue, we provide a general framework for inferring changes in social metrics when dealing with non-stationary sparsity. We propose a plug-in correction that can be applied to any estimator, including several recently proposed procedures. Using both simulated and real data, we demonstrate that the correction significantly improves the accuracy of the estimated change under a variety of plausible data generating processes. In particular, using a large dataset of calls from Afghanistan, we show that whereas traditional methods substantially overestimate the impact of a violent event on social diversity, the plug-in correction reveals the true response to be much more modest.

[1]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[2]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[3]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[4]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[5]  J. Aldrich R.A. Fisher and the making of maximum likelihood 1912-1922 , 1997 .

[6]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[7]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[8]  Alon Orlitsky,et al.  Convergence of profile based estimators , 2005, Proceedings. International Symposium on Information Theory, 2005. ISIT 2005..

[9]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[10]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2009, SIAM J. Comput..

[11]  A. Pentland,et al.  Computational Social Science , 2009, Science.

[12]  Nathan Eagle,et al.  Community Computing: Comparisons between Rural and Urban Societies Using Mobile Phone Data , 2009, 2009 International Conference on Computational Science and Engineering.

[13]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[14]  N. Eagle,et al.  Network Diversity and Economic Development , 2010, Science.

[15]  Eric Horvitz,et al.  People, Quakes, and Communications: Inferences from Call Dynamics about a Seismic Event and its Influences on a Population , 2010, AAAI Spring Symposium: Artificial Intelligence for Development.

[16]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[17]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[18]  Albert-László Barabási,et al.  Collective Response of Human Populations to Large-Scale Emergencies , 2011, PloS one.

[19]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[20]  Vanessa Frías-Martínez,et al.  On the relationship between socio-economic factors and cell phone usage , 2012, ICTD.

[21]  Pascal O. Vontobel The Bethe approximation of the pattern maximum likelihood distribution , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[22]  Hui Zang,et al.  Are call detail records biased for sampling human mobility? , 2012, MOCO.

[23]  Sean Fitzhugh,et al.  Rumoring during extreme events: a case study of deepwater horizon 2010 , 2012, WebSci '12.

[24]  Selim Balcisoy,et al.  Entropy Based Sensitivity Analysis and Visualization of Social Networks , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[25]  Paul Valiant,et al.  Estimating the Unseen , 2013, NIPS.

[26]  Harith Alani,et al.  On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter , 2014, LREC.

[27]  Harith Alani,et al.  Semantic Patterns for Sentiment Analysis of Twitter , 2014, SEMWEB.

[28]  William Chad Young,et al.  Detecting and classifying anomalous behavior in spatiotemporal network data ∗ , 2014 .

[29]  Robert J. Kauffman,et al.  Understanding the paradigm shift to computational social science in the presence of big data , 2014, Decis. Support Syst..

[30]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[31]  David Lazer,et al.  Tracking employment shocks using mobile phone data , 2015, Journal of The Royal Society Interface.

[32]  Manuel Cebrián,et al.  Social Media Fingerprints of Unemployment , 2014, PloS one.

[33]  J. Blumenstock Calling for Better Measurement: Estimating an Individual’s Wealth and Well-Being from Mobile Phone Transaction Records , 2015 .

[34]  Nathan Eagle,et al.  Spatiotemporal Detection of Unusual Human Population Behavior Using Mobile Phone Data , 2014, PloS one.

[35]  Sarah Vieweg,et al.  Processing Social Media Messages in Mass Emergency , 2014, ACM Comput. Surv..

[36]  Zbigniew Smoreda,et al.  Using big data to study the link between human mobility and socio-economic development , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[37]  Pascal O. Vontobel,et al.  Pattern maximum likelihood estimation of finite-state discrete-time Markov chains , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[38]  Emma S. Spiro Research opportunities at the intersection of social media and survey data , 2016 .

[39]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[40]  Ling Yin,et al.  Understanding the bias of call detail records in human mobility research , 2016, Int. J. Geogr. Inf. Sci..

[41]  Alex 'Sandy' Pentland,et al.  bandicoot: a Python Toolbox for Mobile Phone Metadata , 2016, J. Mach. Learn. Res..

[42]  Albert Ali Salah,et al.  Countrywide arrhythmia: emergency event detection using mobile phone data , 2016, EPJ Data Science.

[43]  Marco Fiore,et al.  Filling the gaps: on the completion of sparse call detail records for mobility analysis , 2016, CHANTS@MOBICOM.

[44]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[45]  M. Cha,et al.  Rumor Detection over Varying Time Windows , 2017, PloS one.

[46]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[47]  James Zou,et al.  Estimating the unseen from multiple populations , 2017, ICML.

[48]  Zbigniew Smoreda,et al.  Comparing Regional Patterns of Individual Movement Using Corrected Mobility Entropy , 2018 .

[49]  Alex Pentland,et al.  Methods for quantifying effects of social unrest using credit card transaction data , 2018, EPJ Data Science.

[50]  1956 , 2018, The British Film Catalogue.

[51]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[52]  Tsachy Weissman,et al.  Approximate Profile Maximum Likelihood , 2017, J. Mach. Learn. Res..