Big data uncertainties.

Big data-the idea that an always-larger volume of information is being constantly recorded-suggests that new problems can now be subjected to scientific scrutiny. However, can classical statistical methods be used directly on big data? We analyze the problem by looking at two known pitfalls of big datasets. First, that they are biased, in the sense that they do not offer a complete view of the populations under consideration. Second, that they present a weak but pervasive level of dependence between all their components. In both cases we observe that the uncertainty of the conclusion obtained by statistical methods is increased when used on big data, either because of a systematic error (bias), or because of a larger degree of randomness (increased variance). We argue that the key challenge raised by big data is not only how to use big data to tackle new problems, but to develop tools and methods able to rigorously articulate the new risks therein.

[1]  E. Airoldi,et al.  A natural experiment of social network formation and dynamics , 2015, Proceedings of the National Academy of Sciences.

[2]  S. Shavitt,et al.  Reply to Maley: Yes, appropriate modeling of fatality counts confirms female hurricanes are deadlier , 2014, Proceedings of the National Academy of Sciences.

[3]  Liam J. Murray,et al.  Exposure to oral bisphosphonates and risk of esophageal cancer. , 2010, JAMA.

[4]  F. Galton I. Co-relations and their measurement, chiefly from anthropometric data , 1889, Proceedings of the Royal Society of London.

[5]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[6]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[7]  Steve Maley Statistics show no evidence of gender bias in the public's hurricane preparedness , 2014, Proceedings of the National Academy of Sciences.

[8]  S. Shavitt,et al.  Reply to Christensen and Christensen and to Malter: Pitfalls of erroneous analyses of hurricanes names , 2014, Proceedings of the National Academy of Sciences.

[9]  Mor Naaman,et al.  Extracting Diurnal Patterns of Real World Activity from Social Media , 2013, ICWSM.

[10]  Samy Suissa,et al.  The use of pioglitazone and the risk of bladder cancer in people with type 2 diabetes: nested case-control study , 2012, BMJ : British Medical Journal.

[11]  Daniel Malter Female hurricanes are not deadlier than male hurricanes , 2014, Proceedings of the National Academy of Sciences.

[12]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[13]  D. Cox Big data and precision , 2015 .

[15]  J. Pearl The Causal Mediation Formula—A Guide to the Assessment of Pathways and Mechanisms , 2012, Prevention Science.

[16]  M. Giovannini,et al.  A neuro-fuzzy framework for predicting ash properties in combustion processes , 2003 .

[17]  Richard M Shiffrin,et al.  Drawing causal inference from Big Data , 2016, Proceedings of the National Academy of Sciences.

[18]  Dean Eckles,et al.  Estimating peer effects in networks with peer encouragement designs , 2016, Proceedings of the National Academy of Sciences.

[19]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[20]  John D. Van Horn,et al.  Opinion: Big data biomedicine offers big higher education opportunities , 2016, Proc. Natl. Acad. Sci. USA.

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[23]  William Larson,et al.  Population matters when modeling hurricane fatalities , 2014, Proceedings of the National Academy of Sciences.

[24]  Joshua D. Angrist,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2008 .

[25]  F. Taroni,et al.  The decisionalization of individualization. , 2016, Forensic science international.

[26]  K. Pearson Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia , 1896 .

[27]  Francesca Dominici,et al.  Comment: Addressing the Need for Portability in Big Data Model Building and Calibration , 2016, Journal of the American Statistical Association.

[28]  K. Crawford The Hidden Biases in Big Data , 2013 .

[29]  Thomas M. MacDonald,et al.  Pioglitazone and bladder cancer: a propensity score matched cohort study. , 2013, British journal of clinical pharmacology.

[30]  Cun-Hui Zhang,et al.  Lasso adjustments of treatment effect estimates in randomized experiments , 2015, Proceedings of the National Academy of Sciences.

[31]  G. Yule On the Theory of Correlation for any Number of Variables, Treated by a New System of Notation , 1907 .

[32]  Patrick J. Wolfe,et al.  Network histograms and universality of blockmodel approximation , 2013, Proceedings of the National Academy of Sciences.

[33]  N. Meinshausen,et al.  Methods for causal inference from gene perturbation experiments and validation , 2016, Proceedings of the National Academy of Sciences.

[34]  Gabriela Czanner,et al.  Oral bisphosphonates and risk of cancer of oesophagus, stomach, and colorectum: case-control analysis within a UK primary care cohort , 2010, BMJ : British Medical Journal.

[35]  Albert-László Barabási,et al.  Universal resilience patterns in complex networks , 2016, Nature.

[36]  Björn Christensen,et al.  Are female hurricanes really deadlier than male hurricanes? , 2014, Proceedings of the National Academy of Sciences.