Tasks and methods of Big Data analysis (a survey)

We review tasks and methods most relevant to Big Data analysis. Emphasis is made on the conceptual and pragmatic issues of the tasks and methods (avoiding unnecessary mathematical details). We suggest that all scope of jobs with Big Data fall into four conceptual modes (types): four modes of large-scale usage of Big Data: 1) intelligent information retrieval; 2) massive (large-scale) conveyed data processing (mining); 3) model inference from data; 4) knowledge extraction from data (regularities detection and structures discovery). The essence of various tasks (clustering, regression, generative model inference, structures discovery etc.) are elucidated. We compare key methods of clustering, regression, classification, deep learning, generative model inference and causal discovery. Cluster analysis may be divided into methods based on mean distance, methods based on local distance and methods based on a model. The targeted (predictive) methods fall into two categories: methods which infer a model; "tied to data" methods which compute prediction directly from data. Common tasks of temporal data analysis are briefly overviewed. Among diverse methods of generative model inference we make focus on causal network learning because models of this class are very expressive, flexible and are able to predict effects of interventions under varying conditions. Independence-based approach to causal network inference from data is characterized. We give a few comments on specificity of task of dynamical causal network inference from timeseries. Challenges of Big Data analysis raised by data multidimensionality, heterogeneity and huge volume are presented. Some statistical issues related to the challenges are summarized. Problems in programming 2019; 3: 58-85

[1]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[2]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[3]  Piercesare Secchi,et al.  On the role of statistics in the era of big data: A call for a debate , 2018 .

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Tengyao Wang,et al.  High dimensional change point estimation via sparse projection , 2016, 1606.06246.

[7]  Claus Weihs,et al.  Data Science: the impact of statistics , 2018, International Journal of Data Science and Analytics.

[8]  Athanasios V. Vasilakos,et al.  Big data analytics: a survey , 2015, Journal of Big Data.

[9]  Jennifer Neville,et al.  Relational Dependency Networks , 2007, J. Mach. Learn. Res..

[10]  Bernhard Schölkopf,et al.  Discovering Temporal Causal Relations from Subsampled Data , 2015, ICML.

[11]  Diane J. Cook,et al.  A survey of methods for time series change point detection , 2017, Knowledge and Information Systems.

[12]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[13]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[14]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[15]  Nicolas Vayatis,et al.  A review of change point detection methods , 2018, ArXiv.

[16]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[17]  O. S. Balabanov,et al.  Big Data Analytics: principles, trends and tasks (a survey) , 2019, PROBLEMS IN PROGRAMMING.

[18]  Andrew Schwarz,et al.  Examining the Impact of Multicollinearity in Discovering Higher-Order Factor Models , 2014, Commun. Assoc. Inf. Syst..

[19]  Panos K. Chrysanthis,et al.  Comparison of strategies for scalable causal discovery of latent variable models from mixed data , 2018, International Journal of Data Science and Analytics.

[20]  Kristian Kersting,et al.  Relational Logistic Regression: The Directed Analog of Markov Logic Networks , 2014, StarAI@AAAI.

[21]  Philip F. Musa,et al.  Assessment of Ethiopian Health Facilities Readiness for Implementation of Telemedicine , 2014, Commun. Assoc. Inf. Syst..

[22]  T. Harford,et al.  Big data: A big mistake? , 2014 .

[23]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[24]  Anuj Karpatne,et al.  Spatio-Temporal Data Mining , 2017, ACM Comput. Surv..

[25]  A. S. Balabanov,et al.  Minimal separators in dependency structures: Properties and identification , 2008 .

[26]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[27]  A. J. Bell,et al.  A Unifying Information-Theoretic Framework for Independent Component Analysis , 2000 .

[28]  E. Alzate Modelos de mezclas Bernoulli con regresión logística: una aplicación en la valoración de carteras de crédito , 2020 .

[29]  J Runge,et al.  Causal network reconstruction from time series: From theoretical assumptions to practical estimation. , 2018, Chaos.

[30]  Daniel Malinsky,et al.  Learning the Structure of a Nonstationary Vector Autoregression , 2019, AISTATS.

[31]  Ian H. Witten,et al.  Chapter 1 – What's It All About? , 2011 .

[32]  Jin Tian,et al.  Recovering from Selection Bias in Causal and Statistical Inference , 2014, AAAI.

[33]  Carlos Agón,et al.  Time-series data mining , 2012, CSUR.

[34]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[35]  Cardona Alzate,et al.  Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas , 2020 .

[36]  Norman R. Swanson,et al.  Impulse Response Functions Based on a Causal Approach to Residual Orthogonalization in Vector Autoregressions , 1997 .

[37]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[38]  Elias Bareinboim,et al.  External Validity: From Do-Calculus to Transportability Across Populations , 2014, Probabilistic and Causal Inference.

[39]  Allan Timmermann,et al.  Forecasting in Economics and Finance , 2016 .

[40]  Norman R. Swanson,et al.  Big Data Analytics in Economics: What Have We Learned So Far, and Where Should We Go from Here? , 2017, Canadian Journal of Economics/Revue canadienne d'économique.

[41]  Mehmet M. Dalkilic,et al.  Using data to build a better EM: EM* for big data , 2017, International Journal of Data Science and Analytics.

[42]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[43]  Bradley Efron,et al.  Large-scale inference , 2010 .

[44]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[45]  Reza Zafarani,et al.  Social Media Mining: An Introduction , 2014 .

[46]  Laura M. Sangalli,et al.  The role of Statistics in the era of Big Data , 2018 .

[47]  Longbing Cao,et al.  Data Science , 2017, ACM Comput. Surv..

[48]  Yadira Espinal Viktor Mayer-Schonberger and Kenneth Cukier, Big Data: A Revolution That Will Transform How We Live, Work and Think , 2013 .

[49]  Vipin Kumar,et al.  Anomaly Detection for Discrete Sequences: A Survey , 2012, IEEE Transactions on Knowledge and Data Engineering.

[50]  Ayoub Ait Lahcen,et al.  Big Data technologies: A survey , 2017, J. King Saud Univ. Comput. Inf. Sci..

[51]  Thomas S. Richardson,et al.  Learning high-dimensional directed acyclic graphs with latent and selection variables , 2011, 1104.5617.

[52]  Daniel Malinsky,et al.  Causal Structure Learning from Time Series Causal Structure Learning from Multivariate Time Series in Settings with Unmeasured Confounding , 2018 .

[53]  Giorgos Borboudakis,et al.  Constraint-based causal discovery with mixed data , 2018, International Journal of Data Science and Analytics.

[54]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[55]  Luc De Raedt,et al.  Statistical Relational Artificial Intelligence: Logic, Probability, and Computation , 2016, Statistical Relational Artificial Intelligence.

[56]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[57]  Abhay Bhadani,et al.  Big Data: Challenges, Opportunities and Realities , 2017, ArXiv.

[58]  Diego Colombo,et al.  Order-independent constraint-based causal structure learning , 2012, J. Mach. Learn. Res..

[59]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[60]  C. Granger Investigating Causal Relations by Econometric Models and Cross-Spectral Methods , 1969 .

[61]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[62]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[63]  Ribana Roscher,et al.  Statistical Inference, Learning and Models in Big Data , 2015, ArXiv.

[64]  A. Munk,et al.  Multiscale change point inference , 2013, 1301.7212.

[65]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[66]  Judea Pearl,et al.  The seven tools of causal inference, with reflections on machine learning , 2019, Commun. ACM.

[67]  Bruno Scarpa,et al.  Data Analysis and Data Mining: An Introduction , 2012 .