Handbook of Big Data

Handbook of Big Data provides a state-of-the-art overview of the analysis of large-scale datasets. Featuring contributions from well-known experts in statistics and computer science, this handbook presents a carefully curated collection of techniques from both industry and academia. Thus, the text instills a working understanding of key statistical and computing ideas that can be readily applied in research and practice. Offering balanced coverage of methodology, theory, and applications, this handbook: Describes modern, scalable approaches for analyzing increasingly large datasets Defines the underlying concepts of the available analytical tools and techniques Details intercommunity advances in computational statistics and machine learning Handbook of Big Data also identifies areas in need of further development, encouraging greater communication and collaboration between researchers in big data sub-specialties such as genomics, computational biology, and finance.

[1]  Purnamrita Sarkar,et al.  Hypothesis testing for automated community detection in networks , 2013, ArXiv.

[2]  L. Reichel,et al.  PRIMER FOR THE MATLAB FUNCTIONS IRLBA AND IRLBABLK , 2006 .

[3]  Dean Eckles,et al.  Design and Analysis of Experiments in Networks: Reducing Bias from Interference , 2014, ArXiv.

[4]  Mark J van der Laan,et al.  Estimation Based on Case-Control Designs with Known Prevalence Probability , 2008, The international journal of biostatistics.

[5]  Ryan Hafen,et al.  Divide and recombine (D&R): Data science for large complex data , 2014, Stat. Anal. Data Min..

[6]  Mark van der Laan,et al.  Population Intervention Causal Effects Based on Stochastic Interventions , 2012, Biometrics.

[7]  Mark J van der Laan,et al.  Targeted Maximum Likelihood Estimation of Natural Direct Effects , 2012, The international journal of biostatistics.

[8]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[9]  Mark J. van der Laan,et al.  Nonparametric causal effects based on marginal structural models , 2007 .

[10]  L. Breiman The Little Bootstrap and other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error , 1992 .

[11]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[12]  P. Holland Statistics and Causal Inference , 1985 .

[13]  James M. Robins,et al.  The International Journal of Biostatistics CAUSAL INFERENCE When to Start Treatment ? A Systematic Approach to the Comparison of Dynamic Regimes Using Observational Data , 2011 .

[14]  Eric J Tchetgen Tchetgen,et al.  Why and When "Flawed" Social Network Analyses Still Yield Valid Tests of no Contagion , 2012, Statistics, politics, and policy.

[15]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[16]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[17]  Roberto Todeschini,et al.  Quantitative Structure − Activity Relationship Models for Ready Biodegradability of Chemicals , 2013 .

[18]  Kristin E. Porter,et al.  The Relative Performance of Targeted Maximum Likelihood Estimators , 2011, The international journal of biostatistics.

[19]  D. Francis An introduction to structural equation models. , 1988, Journal of clinical and experimental neuropsychology.

[20]  Catherine M Crespi,et al.  Semiparametric Estimation of the Impacts of Longitudinal Interventions on Adolescent Obesity using Targeted Maximum-Likelihood: Accessible Estimation with the ltmle Package , 2014, Journal of causal inference.

[21]  Gene H. Golub,et al.  Calculating the singular values and pseudo-inverse of a matrix , 2007, Milestones in Matrix Computation.

[22]  Peter Spirtes,et al.  Introduction to Causal Inference , 2010, J. Mach. Learn. Res..

[23]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[24]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[25]  Ryan Hafen,et al.  Visualization Databases for the Analysis of Large Complex Datasets , 2009, AISTATS.

[26]  Mark J. van der Laan,et al.  Causal Inference for a Population of Causally Connected Units , 2014 .

[27]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[28]  N. Christakis,et al.  The Spread of Obesity in a Large Social Network Over 32 Years , 2007, The New England journal of medicine.

[29]  C. Manski Identification of Endogenous Social Effects: The Reflection Problem , 1993 .

[30]  A. Goldberger,et al.  Structural Equation Models in the Social Sciences. , 1974 .

[31]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[33]  Michael W. Berry,et al.  Computing the Sparse Singular Value Decomposition via SVDPACK , 1994 .

[34]  R. Larsen Lanczos Bidiagonalization With Partial Reorthogonalization , 1998 .

[35]  Sherri Rose,et al.  The International Journal of Biostatistics Why Match ? Investigating Matched Case-Control Study Designs with Causal Effect Estimation , 2011 .

[36]  Zhongxiao Jia,et al.  An Implicitly Restarted Refined Bidiagonalization Lanczos Method for Computing a Partial Singular Value Decomposition , 2003, SIAM J. Matrix Anal. Appl..

[37]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[38]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[39]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[40]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[41]  Bowei Xi,et al.  Large complex data: divide and recombine (D&R) with RHIPE , 2012 .

[42]  Simon Lunagomez,et al.  Bayesian Inference from Non-Ignorable Network Sampling Designs , 2013 .

[43]  D. Katz The American Statistical Association , 2000 .

[44]  Yuchung J. Wang,et al.  Stochastic Blockmodels for Directed Graphs , 1987 .

[45]  M. J. van der Laan,et al.  A General Implementation of TMLE for Longitudinal Data Applied to Causal Inference in Survival Analysis , 2012, The international journal of biostatistics.

[46]  Brendan Nyhan,et al.  The "unfriending" problem: The consequences of homophily in friendship retention for causal estimates of social influence , 2010, Soc. Networks.

[47]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[48]  Kesheng Wu,et al.  Thick-Restart Lanczos Method for Large Symmetric Eigenvalue Problems , 2000, SIAM J. Matrix Anal. Appl..

[49]  Jason M. Fletcher,et al.  Is Obesity Contagious? Social Networks vs. Environmental Factors in the Obesity Epidemic , 2008, Journal of Health Economics.

[50]  William S. Cleveland,et al.  Computing environment for the statistical analysis of large and complex data , 2010 .

[51]  Edward I. George,et al.  Bayes and big data: the consensus Monte Carlo algorithm , 2016, Big Data and Information Theory.

[52]  Jon M. Kleinberg,et al.  Graph cluster randomization: network exposure to multiple universes , 2013, KDD.

[53]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[54]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[55]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[56]  Edward R. Tufte,et al.  Visual Explanations: Images and Quantities, Evidence and Narrative , 1997 .

[57]  Mark J. van der Laan,et al.  Causal Mediation in a Survival Setting with Time-Dependent Mediators , 2012 .

[58]  R. Grossman,et al.  Graph-theoretic scagnostics , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[59]  Peter D. Hoff,et al.  Likelihoods for fixed rank nomination networks , 2012, Network Science.

[60]  Michael J Silverberg,et al.  Effect of early versus deferred antiretroviral therapy for HIV on survival. , 2009, The New England journal of medicine.

[61]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[62]  D. A. Kenny,et al.  A New Round Robin Analysis of Variance for Social Interaction Data , 1979 .

[63]  Peter D. Hoff,et al.  Bilinear Mixed-Effects Models for Dyadic Data , 2005 .

[64]  J. van der Laan,et al.  Sensitivity Analysis for Causal Inference Under Unmeasured Confounding and Measurement Error Problems , 2016 .

[65]  Lothar Reichel,et al.  Augmented Implicitly Restarted Lanczos Bidiagonalization Methods , 2005, SIAM J. Sci. Comput..

[66]  Mark J. van der Laan,et al.  Super Learning for Right-Censored Data , 2011 .

[67]  Lothar Reichel,et al.  An implicitly restarted block Lanczos bidiagonalization method using Leja shifts , 2012 .

[68]  Mark J van der Laan,et al.  Identification and Efficient Estimation of the Natural Direct Effect among the Untreated , 2013, Biometrics.

[69]  D. Lazer,et al.  The Coevolution of Networks and Political Attitudes , 2010 .

[70]  J. Robins,et al.  Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. , 2009, International journal of epidemiology.

[71]  Persi Diaconis,et al.  A Sequential Importance Sampling Algorithm for Generating Random Graphs with Prescribed Degrees , 2011, Internet Math..

[72]  Martin Stoll,et al.  A Krylov–Schur approach to the truncated SVD , 2012 .

[73]  Edoardo M. Airoldi,et al.  Stochastic blockmodels with growing number of classes , 2010, Biometrika.

[74]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[75]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[76]  Leland Wilkinson,et al.  Scagnostics Distributions , 2008 .

[77]  Michiel E. Hochstenbach,et al.  A Jacobi-Davidson Type SVD Method , 2001, SIAM J. Sci. Comput..

[78]  Cosma Rohilla Shalizi Comment on "Why and When 'Flawed' Social Network Analyses Still Yield Valid Tests of no Contagion" , 2012, Statistics, politics, and policy.

[79]  Peter D. Hoff,et al.  Modeling homophily and stochastic equivalence in symmetric relational data , 2007, NIPS.

[80]  Hadley Wickham,et al.  The Split-Apply-Combine Strategy for Data Analysis , 2011 .

[81]  David S. Watkins,et al.  The matrix eigenvalue problem - GR and Krylov subspace methods , 2007 .

[82]  Elizabeth L. Ogburn,et al.  Causal diagrams for interference , 2014, 1403.1239.

[83]  Pavel N Krivitsky,et al.  On the Question of Effective Sample Size in Network Modeling: An Asymptotic Inquiry. , 2011, Statistical science : a review journal of the Institute of Mathematical Statistics.

[84]  Charles J. Geyer,et al.  Fuzzy p-values in latent variable problems , 2007 .

[85]  Åke Björck,et al.  An implicit shift bidiagonalization algorithm for ill-posed systems , 1994 .

[86]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[87]  Anthony D Harries,et al.  Reducing tuberculosis-associated early mortality in antiretroviral treatment programmes in sub-Saharan Africa. , 2011, AIDS.

[88]  J. Murabito,et al.  The Spread of Alcohol Consumption Behavior in a Large Social Network , 2010, Annals of Internal Medicine.

[89]  Edoardo M. Airoldi,et al.  A Survey of Statistical Network Models , 2009, Found. Trends Mach. Learn..

[90]  Per-Gunnar Martinsson,et al.  Randomized algorithms for the low-rank approximation of matrices , 2007, Proceedings of the National Academy of Sciences.

[91]  M. J. van der Laan,et al.  Targeted Minimum Loss Based Estimation of Causal Effects of Multiple Time Point Interventions , 2012, The international journal of biostatistics.

[92]  Elias Bareinboim,et al.  Transportability across studies: A formal approach , 2011 .

[93]  Peter M. Aronow,et al.  Estimating Average Causal Effects Under General Interference , 2012 .

[94]  Mark J van der Laan,et al.  The International Journal of Biostatistics Collaborative Targeted Maximum Likelihood for Time to Event Data , 2011 .

[95]  N. Christakis,et al.  Alone in the Crowd: The Structure and Spread of Loneliness in a Large Social Network , 2009 .

[96]  Russell Lyons,et al.  The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis , 2010, 1007.2876.

[97]  Mark J van der Laan,et al.  The International Journal of Biostatistics Direct Effect Models , 2011 .

[98]  D. V. Lindley,et al.  Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment , 1980 .

[99]  Jake Bowers,et al.  Reasoning about Interference Between Units: A General Framework , 2013, Political Analysis.

[100]  David S. Choi,et al.  Estimation of Monotone Treatment Effects in Network Experiments , 2014, ArXiv.

[101]  M. Davidian,et al.  Marginal structural models for analyzing causal effects of time-dependent treatments: an application in perinatal epidemiology. , 2004, American journal of epidemiology.

[102]  Lothar Reichel,et al.  Restarted block Lanczos bidiagonalization methods , 2007, Numerical Algorithms.

[103]  Richard B. Lehoucq,et al.  Implicitly Restarted Arnoldi Methods and Subspace Iteration , 2001, SIAM J. Matrix Anal. Appl..

[104]  P. Rosenbaum Interference Between Units in Randomized Experiments , 2007 .

[105]  M. J. van der Laan,et al.  Targeted Maximum Likelihood Estimation for Dynamic and Static Longitudinal Marginal Structural Working Models , 2014, Journal of causal inference.

[106]  Edoardo M. Airoldi,et al.  Nonparametric estimation and testing of exchangeable graph models , 2014, AISTATS.

[107]  J. Robins,et al.  Sensitivity Analysis for Selection bias and unmeasured Confounding in missing Data and Causal inference models , 2000 .

[108]  J. Robins,et al.  Comparison of dynamic treatment regimes via inverse probability weighting. , 2006, Basic & clinical pharmacology & toxicology.

[109]  N. Christakis,et al.  SUPPLEMENTARY ONLINE MATERIAL FOR: The Collective Dynamics of Smoking in a Large Social Network , 2022 .

[110]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockstructures , 2001 .

[111]  Rembert De Blander,et al.  Mostly Harmless Econometrics: An Empiricist's Companion , 2011 .

[112]  Mark J van der Laan,et al.  Estimation of Direct Causal Effects , 2006, Epidemiology.

[113]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[114]  Mark J van der Laan,et al.  A practical illustration of the importance of realistic individualized treatment rules in causal inference. , 2007, Electronic journal of statistics.

[115]  Maya L. Petersen,et al.  Case Study: Longitudinal HIV Cohort Data , 2011 .

[116]  Sherri Rose Big data and the future , 2012 .

[117]  Peter D Hoff,et al.  Testing and Modeling Dependencies Between a Network and Nodal Attributes , 2013, Journal of the American Statistical Association.

[118]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[119]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[120]  M. J. van der Laan,et al.  Causal Models and Learning from Data: Integrating Causal Modeling and Statistical Estimation , 2014, Epidemiology.

[121]  Edoardo M. Airoldi,et al.  Characterization of Finite Group Invariant Distributions , 2014 .

[122]  A. Zaslavsky,et al.  Estimating Peer Effects in Longitudinal Dyadic Data Using Instrumental Variables , 2014, Biometrics.

[123]  J. Robins,et al.  Estimation and extrapolation of optimal treatment and testing strategies , 2008, Statistics in medicine.

[124]  Mir M. Ali,et al.  Estimating peer effects in adolescent smoking behavior: a longitudinal analysis. , 2009, The Journal of adolescent health : official publication of the Society for Adolescent Medicine.

[125]  Tyler J. VanderWeele,et al.  Vaccines, Contagion, and Social Networks , 2014, ArXiv.

[126]  Ryan Hafen,et al.  Trelliscope: A system for detailed visualization in the deep analysis of large complex data , 2013, 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV).

[127]  Cohen-Cole,et al.  Estimating peer effects on health in social networks : A response to , 2008 .

[128]  N. Christakis,et al.  Social Network Sensors for Early Detection of Contagious Outbreaks , 2010, PloS one.

[129]  James Baglama,et al.  IMPLICITLY RESTARTING THE LSQR ALGORITHM , 2014 .

[130]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[131]  Stephen R Cole,et al.  Constructing inverse probability weights for marginal structural models. , 2008, American journal of epidemiology.

[132]  M. Halloran,et al.  Causal Inference in Infectious Diseases , 1995, Epidemiology.

[133]  Richard A. Becker,et al.  The Visual Design and Control of Trellis Display , 1996 .

[134]  Nicholas A. Christakis,et al.  Social contagion theory: examining dynamic social networks and human behavior , 2011, Statistics in medicine.

[135]  Zhongxiao Jia,et al.  A Refined Harmonic Lanczos Bidiagonalization Method and an Implicitly Restarted Algorithm for Computing the Smallest Singular Triplets of Large Matrices , 2009, SIAM J. Sci. Comput..

[136]  Onyebuchi A Arah,et al.  Bias Formulas for Sensitivity Analysis of Unmeasured Confounding for General Outcomes, Treatments, and Confounders , 2011, Epidemiology.

[137]  Mark J van der Laan,et al.  EFFECT OF BREASTFEEDING ON GASTROINTESTINAL INFECTION IN INFANTS: A TARGETED MAXIMUM LIKELIHOOD APPROACH FOR CLUSTERED LONGITUDINAL DATA. , 2014, The annals of applied statistics.

[138]  C. Speil,et al.  Effect of Early versus Deferred Antiretroviral Therapy for HIV on Survival , 2009 .

[139]  Maya L Petersen,et al.  Compound treatments, transportability, and the structural causal model: the power and simplicity of causal graphs. , 2011, Epidemiology.

[140]  Tyler J. VanderWeele,et al.  Marginal Structural Models for the Estimation of Direct and Indirect Effects , 2009, Epidemiology.

[141]  Peter D. Hoff,et al.  Testing for Nodal Dependence in Relational Data Matrices , 2013, Journal of the American Statistical Association.

[142]  Franklin T. Luk,et al.  A Block Lanczos Method for Computing the Singular Values and Corresponding Singular Vectors of a Matrix , 1981, TOMS.

[143]  T. VanderWeele Sensitivity Analysis for Contagion Effects in Social Networks , 2011, Sociological methods & research.

[144]  Danny C. Sorensen,et al.  Implicit Application of Polynomial Filters in a k-Step Arnoldi Method , 1992, SIAM J. Matrix Anal. Appl..

[145]  Charles F. Manski,et al.  Identification of Treatment Response with Social Interactions , 2013 .

[146]  M. Hernán A definition of causal effect for epidemiological research , 2004, Journal of Epidemiology and Community Health.

[147]  Edoardo M. Airoldi,et al.  Stochastic blockmodel approximation of a graphon: Theory and consistent estimation , 2013, NIPS.

[148]  N. Christakis,et al.  Alone in the Crowd: The Structure and Spread of Loneliness in a Large Social Network , 2008 .

[149]  J. Pearl Causal diagrams for empirical research , 1995 .

[150]  Cosma Rohilla Shalizi,et al.  Homophily and Contagion Are Generically Confounded in Observational Social Network Studies , 2010, Sociological methods & research.

[151]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[152]  L. Keele,et al.  Identification, Inference and Sensitivity Analysis for Causal Mediation Effects , 2010, 1011.1079.

[153]  J. Robins,et al.  Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. , 2000, Epidemiology.

[154]  E. Moodie,et al.  Targeted maximum likelihood estimation for marginal time-dependent treatment effects under density misspecification. , 2013, Biostatistics.

[155]  Mark J van der Laan,et al.  Assessing the effectiveness of antiretroviral adherence interventions. Using marginal structural models to replicate the findings of randomized controlled trials. , 2006, Journal of acquired immune deficiency syndromes.

[156]  Peter D. Hoff,et al.  Latent Space Approaches to Social Network Analysis , 2002 .

[157]  J. Robins,et al.  Estimating causal effects from epidemiological data , 2006, Journal of Epidemiology and Community Health.

[158]  Efstratios Gallopoulos,et al.  Computing smallest singular triplets with implicitly restarted Lanczos bidiagonalization , 2004, Applied Numerical Mathematics.

[159]  M. Hernán,et al.  Compound Treatments and Transportability of Causal Inference , 2011, Epidemiology.

[160]  Mark J. van der Laan,et al.  RCTs with Time-to-Event Outcomes , 2011 .

[161]  Ronald B. Morgan,et al.  On restarting the Arnoldi method for large nonsymmetric eigenvalue problems , 1996, Math. Comput..

[162]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[163]  Kristin E. Porter,et al.  Diagnosing and responding to violations in the positivity assumption , 2012, Statistical methods in medical research.

[164]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data , 2009 .

[165]  Paul Erdös,et al.  On random graphs, I , 1959 .

[166]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[167]  M. J. van der Laan,et al.  Causal Effect Models for Realistic Individualized Treatment and Intention to Treat Rules , 2007, The international journal of biostatistics.

[168]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[169]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[170]  Michiel E. Hochstenbach,et al.  Harmonic and Refined Extraction Methods for the Singular Value Problem, with Applications in Least Squares Problems , 2004 .

[171]  D. Rubin Estimating causal effects of treatments in randomized and nonrandomized studies. , 1974 .

[172]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[173]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[174]  Edward K. Kao,et al.  Estimation of Causal Peer Influence Effects , 2013, ICML.

[175]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[176]  Hongyuan Zha,et al.  Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications , 1999, SIAM J. Sci. Comput..

[177]  Sander Greenland,et al.  An introduction to instrumental variables for epidemiologists. , 2018, International journal of epidemiology.

[178]  Nicholas P. Jewell,et al.  Direct Effects and Effect Among the Treated , 2011 .

[179]  M. Robins James,et al.  Estimation of the causal effects of time-varying exposures , 2008 .