A Survey of Statistical Methods and Computing for Big Data

Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard software tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article reviews recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and sequential updating for stream data. Software review focuses on the open source R and R packages, covering recent tools that help break the barriers of computer memory and computing power. Some of the tools are illustrated in a case study with a logistic regression for the chance of airline delay.

[1]  Roger D. Peng,et al.  INTERACTING WITH DATA USING THE FILEHASH PACKAGE FOR R , 2006 .

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Ulf-Dietrich Reips,et al.  "Big Data" : big gaps of knowledge in the field of internet science , 2012 .

[4]  Alan J. Miller,et al.  Least Squares Routines to Supplement Those of Gentleman , 1992 .

[5]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[6]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[7]  Vasant Dhar Why Big Data = Big Deal , 2014, Big Data.

[8]  Robert Rodriguez,et al.  Big data and better data , 2012 .

[9]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[10]  Xiaoxiao Sun,et al.  Leveraging for big data regression , 2015 .

[11]  Pierre L'Ecuyer,et al.  An Object-Oriented Random-Number Package with Many Long Streams and Substreams , 2002, Oper. Res..

[12]  Conrad Sanderson,et al.  RcppArmadillo: Accelerating R with high-performance C++ linear algebra , 2014, Comput. Stat. Data Anal..

[13]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[14]  Luke Tierney Code analysis and parallelizing vector operations in R , 2009, Comput. Stat..

[15]  Shaowen Wang,et al.  Parallelizing MCMC for Bayesian spatiotemporal geostatistical models , 2007, Stat. Comput..

[16]  John M. Jordan,et al.  Statistics for Big Data: Are Statisticians Ready for Big Data? , 2014 .

[17]  Robert N. Rodriguez,et al.  High-Performance Statistical Modeling , 2013 .

[18]  Michael I. Jordan On statistics, computation and scalability , 2013, ArXiv.

[19]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[20]  Olaf Mersmann,et al.  Accurate Timing Functions , 2015 .

[21]  Minge Xie,et al.  Confidence Distributions and a Unifying Framework for Meta-Analysis , 2011 .

[22]  Alan J. Miller Correction to Algorithm as 274: Least Squares Routines to Supplement Those of Gentleman , 1994 .

[23]  Jianqing Fan,et al.  Nonconcave Penalized Likelihood With NP-Dimensionality , 2009, IEEE Transactions on Information Theory.

[24]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Faming Liang,et al.  A Bootstrap Metropolis–Hastings Algorithm for Bayesian Analysis of Big Data , 2016, Technometrics.

[27]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[28]  Dirk Eddelbuettel,et al.  Seamless R and C++ Integration with Rcpp , 2013 .

[29]  Bowei Xi,et al.  Large complex data: divide and recombine (D&R) with RHIPE , 2012 .

[30]  Michael Mascagni,et al.  SPRNG: A Scalable Library for Pseudorandom Number Generation , 1999, PP.

[31]  Minge Xie,et al.  Combining information from independent sources through confidence distributions , 2005, math/0504507.

[32]  Jing Wu,et al.  Online Updating of Statistical Inference in the Big Data Setting , 2015, Technometrics.

[33]  H. Wickham Bin-summarise-smooth : A framework for visualising large data , 2013 .

[34]  Na Li,et al.  Simple Parallel Statistical Computing in R , 2007 .

[35]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[36]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[37]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[38]  Cliburn Chan,et al.  Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[39]  M. Lercher,et al.  PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R , 2014, Molecular biology and evolution.

[40]  Cynthia Rudin,et al.  Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society , 2014 .

[41]  Ruibin Xi,et al.  Aggregated estimating equation estimation , 2011 .

[42]  Stephen Weston,et al.  Scalable Strategies for Computing with Massive Data , 2013 .

[43]  Hao Yu,et al.  State of the Art in Parallel Computing with R , 2009 .

[44]  Leo Breiman,et al.  Big Random Forests: Classification and Regression Forests forLarge Data Sets , 2014 .

[45]  F. Götze,et al.  RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES , 2012 .

[46]  Marie Davidian Aren't we data science? , 2013 .

[47]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[48]  F. Liang,et al.  A Resampling-Based Stochastic Approximation Method for Analysis of Large Geostatistical Data , 2013 .

[49]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[50]  John W. Emerson,et al.  Don't drown in the data , 2012 .

[51]  Elizaveta Levina,et al.  Discussion of "Stability selection" by N. Meinshausen and P. Buhlmann , 2010 .