Statistical Perspectives on “Big Data”

As our information infrastructure evolves, our ability to store, extract, and analyze data is rapidly changing. Big data is a popular term that is used to describe the large, diverse, complex and/or longitudinal datasets generated from a variety of instruments, sensors and/or computer-based transactions. The term big data refers not only to the size or volume of data, but also to the variety of data and the velocity or speed of data accrual. As the volume, variety, and velocity of data increase, our existing analytical methodologies are stretched to new limits. These changes pose new opportunities for researchers in statistical methodology, including those interested in surveillance and statistical process control methods. Although it is well documented that harnessing big data to make better decisions can serve as a basis for innovative solutions in industry, healthcare, and science, these solutions can be found more easily with sound statistical methodologies. In this paper, we discuss several big data applications to highlight the opportunities and challenges for applied statisticians interested in surveillance and statistical process control. Our goal is to bring the research issues into better focus and encourage methodological developments for big data analysis in these areas.

[1]  Peihua Qiu,et al.  Multivariate Statistical Process Control Using LASSO , 2009 .

[2]  J. Brownstein,et al.  Digital disease detection--harnessing the Web for public health surveillance. , 2009, The New England journal of medicine.

[3]  Benjamin T. Hazen,et al.  Applying Control Chart Methods to Enhance Data Quality , 2014, Technometrics.

[4]  Julio Tessore,et al.  Statistical Control of Multiple-Stream Processes: A Shewhart Control Chart for Each Stream , 2008 .

[5]  Lawrence B. Holder,et al.  Mining Graph Data , 2006 .

[6]  D. J. Reifer,et al.  Application stress testing Achieving cyber security by testing cyber attacks , 2012, 2012 IEEE Conference on Technologies for Homeland Security (HST).

[7]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[8]  Ali Cinar,et al.  Monitoring, fault diagnosis, fault-tolerant control and optimization: Data driven methods , 2012, Comput. Chem. Eng..

[9]  Napassavong Rojanarowan,et al.  A Guideline to Select Control Charts for Multiple Stream Processes Control , 2011 .

[10]  Howard S. Burkom,et al.  Statistical Challenges Facing Early Outbreak Detection in Biosurveillance , 2010, Technometrics.

[11]  Gregory F Cooper,et al.  Issues in applied statistics for public health bioterrorism surveillance using multiple data streams: research needs , 2007, Statistics in medicine.

[12]  John F. MacGregor,et al.  Multivariate SPC charts for monitoring batch processes , 1995 .

[13]  Douglas C. Montgomery,et al.  Using Control Charts to Monitor Process and Product Quality Profiles , 2004 .

[14]  Stefan H. Steiner,et al.  Monitoring Multiple Stream Processes , 2008 .

[15]  W. Melssen,et al.  Multivariate statistical process control using mixture modelling , 2005 .

[16]  F. Ferraty,et al.  The Oxford Handbook of Functional Data Analysis , 2011, Oxford Handbooks Online.

[17]  David S. Wishart,et al.  Applications of Machine Learning in Cancer Prediction and Prognosis , 2006, Cancer informatics.

[18]  Jaime A. Camelio,et al.  A Review and Perspective on Control Charting with Image Data , 2011 .

[19]  Karen A. Scarfone,et al.  Guide to Intrusion Detection and Prevention Systems (IDPS) , 2007 .

[20]  William H. Woodall,et al.  Controversies and Contradictions in Statistical Process Control , 2000 .

[21]  Kwok-Leung Tsui,et al.  A Review of Healthcare, Public Health, and Syndromic Surveillance , 2008 .

[22]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[23]  Richard Y. Wang,et al.  Quality information and knowledge , 1998 .

[24]  James O. Ramsay,et al.  Applied Functional Data Analysis: Methods and Case Studies , 2002 .

[25]  Mario Santana Infrastructure as a Service (IaaS) , 2016 .

[26]  Chih-Chou Chiu,et al.  Using radial basis function neural networks to recognize shifts in correlated manufacturing process parameters , 1998 .

[27]  Rassoul Noorossana,et al.  Statistical Analysis of Profile Monitoring: Noorossana/Profile Monitoring , 2011 .

[28]  George C. Runger,et al.  System Monitoring with Real-Time Contrasts , 2012 .

[29]  Lawrence B. Holder,et al.  Mining Graph Data: Cook/Mining Graph Data , 2006 .

[30]  Dylan B. George,et al.  Big Data Opportunities for Global Infectious Disease Surveillance , 2013, PLoS medicine.

[31]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[32]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[33]  Haichao Zhang,et al.  Mitigating distributed denial-of-service attacks using network connection control charts , 2007 .

[34]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[35]  George C. Runger,et al.  Statistical Process Control of Multiple Stream Processes , 1995 .

[36]  Amir Parssian,et al.  Managerial decision support with knowledge of accuracy and completeness of the relational aggregate functions , 2006, Decis. Support Syst..

[37]  Douglas C. Montgomery,et al.  Monitoring a Multiple Stream Filling Operation Using Fractional Samples , 2002 .

[38]  Christos Faloutsos,et al.  Graph Mining: Laws, Tools, and Case Studies , 2012, Synthesis Lectures on Data Mining and Knowledge Discovery.

[39]  L. Nelson Data, data everywhere. , 1997, Critical care medicine.

[40]  Fugee Tsung,et al.  LASSO-based multivariate linear profile monitoring , 2012, Ann. Oper. Res..

[41]  Ratna Babu Chinnam,et al.  Support vector machines for recognizing shifts in correlated and other manufacturing processes , 2002 .

[42]  Manuel Filipe Santos,et al.  KDD, SEMMA and CRISP-DM: a parallel overview , 2008, IADIS European Conf. Data Mining.

[43]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[44]  Joe H. Sullivan,et al.  Detection of Multiple Change Points from Clustering Individual Observations , 2002 .

[45]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[46]  Giovanna Capizzi,et al.  A Least Angle Regression Control Chart for Multidimensional Data , 2011, Technometrics.

[47]  Douglas C. Montgomery,et al.  Introduction to Statistical Quality Control , 1986 .

[48]  J. Brownstein,et al.  Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. , 2012, The American journal of tropical medicine and hygiene.

[49]  Wei Jiang,et al.  A LASSO-Based Diagnostic Framework for Multivariate Statistical Process Control , 2011, Technometrics.

[50]  Rassoul Noorossana,et al.  Statistical Analysis of Profile Monitoring , 2011 .

[51]  Radu Prodan,et al.  A survey and taxonomy of infrastructure as a service and web hosting cloud providers , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[52]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[53]  Salvatore J. Stolfo,et al.  A data mining framework for building intrusion detection models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[54]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[55]  William H. Woodall,et al.  Performance Metrics for Surveillance Schemes , 2008 .

[56]  W. Edwards Deming,et al.  The New Economics for Industry, Government, Education , 2018 .

[57]  Fugee Tsung,et al.  Monitoring a process with mixed-type and high-dimensional data , 2010, 2010 IEEE International Conference on Industrial Engineering and Engineering Management.

[58]  Hang Zhang,et al.  Determining Statistical Process Control Baseline Periods in Long Historical Data Streams , 2010 .

[59]  Michael Pokojovy,et al.  A Multistep, Cluster-Based Multivariate Chart for Retrospective Monitoring of Individuals , 2009 .

[60]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[61]  Fugee Tsung,et al.  A kernel-distance-based multivariate control chart using support vector methods , 2003 .

[62]  John F. MacGregor,et al.  Multivariate image analysis in the process industries: A review , 2012 .

[63]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[64]  Andy Cowper,et al.  All too much , 2004 .

[65]  George C. Runger,et al.  Multivariate statistical process control with artificial contrasts , 2007 .