Challenges of Big Data Analysis.

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

[1]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[2]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[3]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[4]  Fang Han,et al.  Transelliptical Component Analysis , 2012, NIPS.

[5]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[6]  Jun Zhang,et al.  Robust rank correlation based screening , 2010, 1012.4255.

[7]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[8]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[9]  T. Cai,et al.  A Direct Estimation Approach to Sparse Linear Discriminant Analysis , 2011, 1107.3442.

[10]  Zhaoran Wang,et al.  OPTIMAL COMPUTATIONAL AND STATISTICAL RATES OF CONVERGENCE FOR SPARSE NONCONVEX LEARNING PROBLEMS. , 2013, Annals of statistics.

[11]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[12]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[13]  Fang Han,et al.  Principal Component Analysis on non-Gaussian Dependent Data , 2013, ICML.

[14]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[15]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[16]  Jianqing Fan,et al.  Regularization of Wavelet Approximations , 2001 .

[17]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[18]  Kristina M. Visscher,et al.  Would the field of cognitive neuroscience be advanced by sharing functional MRI data? , 2011, BMC medicine.

[19]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[20]  Motoaki Kawanabe,et al.  In Search of Non-Gaussian Components of a High-Dimensional Distribution , 2006, J. Mach. Learn. Res..

[21]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[22]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[23]  Jianqing Fan,et al.  Control of the False Discovery Rate Under Arbitrary Covariance Dependence , 2010, 1012.4397.

[24]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[26]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[27]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[28]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[29]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[30]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[31]  H. Akaike A new look at the statistical model identification , 1974 .

[32]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[33]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[34]  T. Tony Cai,et al.  Phase transition in limiting distributions of coherence of high-dimensional random matrices , 2011, J. Multivar. Anal..

[35]  Jianqing Fan,et al.  High dimensional covariance matrix estimation using a factor model , 2007, math/0701124.

[36]  Clifford Lam,et al.  Factor modeling for high-dimensional time series: inference for the number of factors , 2012, 1206.0613.

[37]  J. Bai,et al.  Determining the Number of Factors in Approximate Factor Models , 2000 .

[38]  M. Dempster,et al.  Risk Management: Frontmatter , 2002 .

[39]  P. Yip,et al.  Discrete Cosine Transform: Algorithms, Advantages, Applications , 1990 .

[40]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[41]  Jianqing Fan,et al.  Features of Big Data and sparsest solution in high confidence set , 2014 .

[42]  Cun-Hui Zhang,et al.  ORACLE INEQUALITIES FOR THE LASSO IN THE COX MODEL. , 2013, Annals of statistics.

[43]  Jianqing Fan,et al.  COVARIANCE ASSISTED SCREENING AND ESTIMATION. , 2014, Annals of statistics.

[44]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[45]  Albert-László Barabási,et al.  A Dynamic Network Approach for the Study of Human Phenotypes , 2009, PLoS Comput. Biol..

[46]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[47]  J. Bai,et al.  Inferential Theory for Factor Models of Large Dimensions , 2003 .

[48]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[49]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[50]  M. Milham,et al.  The ADHD-200 Consortium: A Model to Advance the Translational Potential of Neuroimaging in Clinical Neuroscience , 2012, Front. Syst. Neurosci..

[51]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[52]  Yaakov Tsaig,et al.  Extensions of compressed sensing , 2006, Signal Process..

[53]  A. Barabasi,et al.  The human disease network , 2007, Proceedings of the National Academy of Sciences.

[54]  Wenxin Jiang,et al.  Posterior Consistency of Nonparametric Conditional Moment Restricted Models , 2010, 1105.4847.

[55]  James B. Brown,et al.  An overview of recent developments in genomics and associated statistical methods , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[56]  Thorsten Rheinländer Risk Management: Value at Risk and Beyond , 2003 .

[57]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[58]  Weidong Liu,et al.  Adaptive Thresholding for Sparse Covariance Matrix Estimation , 2011, 1102.2237.

[59]  Peter Hall,et al.  Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems , 2009 .

[60]  Hugo Y. K. Lam,et al.  Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes , 2012, Cell.

[61]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[62]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[63]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[64]  Daniel P. Kennedy,et al.  The Autism Brain Imaging Data Exchange: Towards Large-Scale Evaluation of the Intrinsic Brain Architecture in Autism , 2013, Molecular Psychiatry.

[65]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[66]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[67]  Marco Lippi,et al.  The Generalized Dynamic Factor Model , 2002 .

[68]  H. Zou,et al.  Regularized rank-based estimation of high-dimensional nonparanormal graphical models , 2012, 1302.3082.

[69]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[70]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[71]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[72]  Mohsen Pourahmadi,et al.  High-Dimensional Covariance Estimation , 2013 .

[73]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[74]  A. Tsybakov,et al.  High-dimensional instrumental variables regression and confidence sets -- v2/2012 , 2018, 1812.11330.

[75]  Jianqing Fan,et al.  Large covariance estimation by thresholding principal orthogonal complements , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[76]  Marc G. Berman,et al.  What has Functional Neuroimaging told us about the Mind? So Many Examples, So Little Space , 2006, Cortex.

[77]  A. Antoniadis Wavelets in statistics: A review , 1997 .

[78]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[79]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[80]  Bernhard Ganter,et al.  Formal Concept Analysis , 2013 .

[81]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[82]  Yang Feng,et al.  Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models , 2009, Journal of the American Statistical Association.

[83]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[84]  Jianqing Fan,et al.  Endogeneity in Ultrahigh Dimension , 2012 .

[85]  P. Bickel,et al.  Large Vector Auto Regressions , 2011, 1106.3915.

[86]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[87]  Fang Han,et al.  Transition Matrix Estimation in High Dimensional Time Series , 2013, ICML.

[88]  Jonathan E. Taylor,et al.  Empirical null and false discovery rate analysis in neuroimaging , 2009, NeuroImage.

[89]  Sergio Contrino,et al.  ArrayExpress—a public repository for microarray gene expression data at the EBI , 2004, Nucleic Acids Res..

[90]  Joel Owen,et al.  On the Class of Elliptical Distributions and Their Applications to the Theory of Portfolio Choice , 1983 .

[91]  Jianqing Fan,et al.  Journal of the American Statistical Association Estimating False Discovery Proportion under Arbitrary Covariance Dependence Estimating False Discovery Proportion under Arbitrary Covariance Dependence , 2022 .

[92]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[93]  Victor Vianu,et al.  Invited articles section foreword , 2010, JACM.

[94]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[95]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[96]  B. Efron Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates , 2010, Journal of the American Statistical Association.

[97]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[98]  Wei Pan,et al.  A Data-Adaptive Sum Test for Disease Association with Multiple Common or Rare Variants , 2010, Human Heredity.

[99]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[100]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[101]  Martin J. Wainwright,et al.  Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions , 2011, ICML.

[102]  S. Geer,et al.  ℓ1-penalization for mixture regression models , 2010, 1202.6046.

[103]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[104]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[105]  Mizuki Morita,et al.  Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter , 2011, EMNLP.

[106]  Jonathan C. Cohen,et al.  Multiple Rare Alleles Contribute to Low Plasma Levels of HDL Cholesterol , 2004, Science.

[107]  Larry A. Wasserman,et al.  A Comparison of the Lasso and Marginal Regression , 2012, J. Mach. Learn. Res..

[108]  Marina Fruehauf,et al.  Nonlinear Programming Analysis And Methods , 2016 .

[109]  Jacek M. Zurada,et al.  Computational Intelligence: Imitating Life , 1994 .

[110]  Jiahua Chen,et al.  Variable Selection in Finite Mixture of Regression Models , 2007 .

[111]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[112]  Larry A. Wasserman,et al.  High Dimensional Semiparametric Gaussian Copula Graphical Models. , 2012, ICML 2012.

[113]  Clifford Lam,et al.  Factor Modeling for High Dimensional Time Series , 2011 .

[114]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[115]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[116]  David P Bick,et al.  Making a definitive diagnosis: Successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease , 2011, Genetics in Medicine.

[117]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[118]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[119]  Celina G. Kleer,et al.  Discoidin domain receptor tyrosine kinases: new players in cancer progression , 2012, Cancer and Metastasis Reviews.

[120]  J. Stock,et al.  Forecasting Using Principal Components From a Large Number of Predictors , 2002 .

[121]  Martin J. Wainwright,et al.  Fast global convergence of gradient methods for high-dimensional statistical recovery , 2011, ArXiv.

[122]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[123]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[124]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[125]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[126]  Benjamin Thyreau,et al.  Detecting outliers in high-dimensional neuroimaging datasets with robust covariance estimators , 2012, Medical Image Anal..

[127]  Fang Han,et al.  Transelliptical Graphical Models , 2012, NIPS.

[128]  Bogdan E. Popescu,et al.  Gradient Directed Regularization for Linear Regression and Classi…cation , 2004 .

[129]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[130]  Han Liu,et al.  Scale-Invariant Sparse PCA on High-Dimensional Meta-Elliptical Data , 2014, Journal of the American Statistical Association.

[131]  Malay Ghosh,et al.  Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes , 2008 .

[132]  D. Donoho,et al.  Sparse MRI: The application of compressed sensing for rapid MR imaging , 2007, Magnetic resonance in medicine.