Big Data in context and robustness against heterogeneity

The phrase Big Data has generated substantial current discussion within and outside of the field of statistics. Some personal observations about this phenomenon are discussed. One contribution is to put this set of ideas into a larger historical context. Another is to point out the related important concept of robustness against data heterogeneity, and some earlier methods which had that property, and also to discuss a number of interesting open problems motivated by this concept.

[1]  A. Nobel,et al.  Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[2]  N. Laird,et al.  Meta-analysis in clinical trials. , 1986, Controlled clinical trials.

[3]  Guang Cheng,et al.  A Bayesian Splitotic Theory For Nonparametric Models , 2015 .

[4]  Anthony C. Atkinson,et al.  Monitoring robust regression , 2014 .

[5]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[6]  James Stephen Marron,et al.  Object-Oriented Data Analysis of Cell Images , 2014 .

[7]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[8]  Yufeng Liu,et al.  Statistical Significance of Clustering Using Soft Thresholding , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[9]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[10]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[11]  J Steve Marron,et al.  Overview of object oriented data analysis , 2014, Biometrical journal. Biometrische Zeitschrift.

[12]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[13]  Makoto Aoshima,et al.  A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data , 2014 .

[14]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[15]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[16]  J. Beran M Estimators of Location for Gaussian and Related Processes With Slowly Decaying Serial Correlations , 1991 .

[17]  J. Marron,et al.  Object oriented data analysis: Sets of trees , 2007, 0711.3147.

[18]  Joel S. Parker,et al.  Visualization of Cross‐Platform Microarray Normalization , 2009 .

[19]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[20]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[21]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[22]  Peter Bühlmann,et al.  Magging: Maximin Aggregation for Inhomogeneous Large-Scale Data , 2014, Proceedings of the IEEE.

[23]  Christian Hennig,et al.  What are the true clusters? , 2015, Pattern Recognit. Lett..

[24]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[25]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..

[26]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[27]  Han Liu,et al.  A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA. , 2014, Annals of statistics.

[28]  J. S. Marron,et al.  Distance-Weighted Discrimination , 2007 .

[29]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[30]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.