Embracing Statistical Challenges in the Information Technology Age

This article examines the role of statistics in the age of information technology (IT). It begins by examining the current state of IT and of the cyberinfrastructure initiative aimed at integrating the technologies into science, engineering, and education to convert massive amounts of data into useful information. Selected applications from science and text processing are introduced to provide concrete examples of massive data sets and the statistical challenges that they pose. The thriving field of machine learning is reviewed as an example of current achievements driven by computations and IT. Ongoing challenges that we face in the IT revolution are also highlighted. The paper concludes that for the healthy future of our field, computer technologies have to be integrated into statistics, and statistical thinking in turn must be integrated into computer technologies.

[1]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[2]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[3]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[4]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[5]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[6]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[7]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[8]  Kannan Ramchandran,et al.  Microarray image compression: SLOCO and the effect of information loss , 2003, Signal Processing.

[9]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[10]  Ioana M. Boier-Martin,et al.  Visualization Viewpoints , 2000 .

[11]  Jianfeng Gao,et al.  Approximation Lasso Methods for Language Modeling , 2006, ACL.

[12]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[13]  Tulay Koru-Sengul,et al.  Graphics of Large Datasets: Visualizing a Million , 2007, Technometrics.

[14]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[15]  Silvia Nittel,et al.  Semi-Streaming Quantization for Remote Sensing Data , 2003 .

[16]  Terrence J. Sejnowski,et al.  A Variational Principle for Graphical Models , 2007 .

[17]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[18]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[19]  Martin J. Wainwright,et al.  Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting , 2006, J. Mach. Learn. Res..

[20]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[21]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[22]  Terrence J. Sejnowski,et al.  New Directions in Statistical Signal Processing: From Systems to Brains (Neural Information Processing) , 2006 .

[23]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[24]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Sallie Keller-McNulty,et al.  Workshop on Statistical Approaches for the Evaluation of Complex Computer Models , 2002 .

[26]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[27]  Mark Derthick,et al.  Visualization of Large Multi-Dimensional Datasets , 2000 .

[28]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[29]  Daniel Atkins,et al.  Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure , 2003 .

[30]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[31]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[32]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[33]  Anil K. Jain,et al.  Fingerprint-Based Recognition , 2007, Technometrics.

[34]  Christophe Diot,et al.  Taxonomy of IP traffic matrices , 2002, SPIE ITCom.

[35]  Lenore Blum,et al.  Complexity and Real Computation , 1997, Springer New York.

[36]  H. Hofmann Exploring categorical data: interactive mosaic plots , 2000 .

[37]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[38]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[39]  E. Wegman Hyperdimensional Data Analysis Using Parallel Coordinates , 1990 .

[40]  Colin L. Mallows,et al.  Tukey's Paper After 40 Years , 2006, Technometrics.

[41]  Bin Yu,et al.  Daytime Arctic Cloud Detection Based on Multi-Angle Satellite Data With Case Studies , 2008 .

[42]  Larry A. Wasserman,et al.  Rodeo: Sparse Nonparametric Regression in High Dimensions , 2005, NIPS.

[43]  Robert Nowak,et al.  Internet tomography , 2002, IEEE Signal Process. Mag..

[44]  Anna C. Gilbert,et al.  Analysis of Data Streams: Computational and Algorithmic Challenges , 2007, Technometrics.

[45]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[46]  Konstantina Papagiannaki,et al.  Structural analysis of network traffic flows , 2004, SIGMETRICS '04/Performance '04.

[47]  冨山 芳幸,et al.  「Completing the Forecast : Characterizing and Communicating Uncertainty for Better Decisions Using Weather and Climate Forecasts」, National Research Council著, The National Academies Press, 2006年, 112頁, 25.88ドル, ISBN-13:978-0-309-10255-1(本だな) , 2007 .

[48]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[49]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[50]  Michael I. Jordan Graphical Models , 2003 .

[51]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[52]  Chris R. Johnson Top Scientific Visualization Research Problems , 2004, IEEE Computer Graphics and Applications.

[53]  Diane Lambert,et al.  Monitoring Networked Applications With Incremental Quantile Estimation , 2006, 0708.0302.

[54]  Konstantina Papagiannaki,et al.  Traffic matrices: balancing measurements, inference and modeling , 2005, SIGMETRICS '05.

[55]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[56]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[57]  Victoria Interrante,et al.  Visualization viewpoints , 2003 .

[58]  R. Viertl On the Future of Data Analysis , 2002 .

[59]  David J. Marchette,et al.  On Some Techniques for Streaming Data: A Case Study of Internet Packet Headers , 2003 .

[60]  P. Zhao Boosted Lasso , 2004 .

[61]  Jianfeng Gao,et al.  Exploiting Headword Dependency and Predictive Clustering for Language Modeling , 2002, EMNLP.

[62]  Mark Bailey,et al.  The Grammar of Graphics , 2007, Technometrics.

[63]  Bernhard Schölkopf,et al.  Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[64]  David Madigan,et al.  [A Report on the Future of Statistics]: Comment , 2004 .

[65]  Y. Vardi,et al.  Network Tomography: Estimating Source-Destination Traffic Intensities from Link Data , 1996 .

[66]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[67]  Bin Yu,et al.  A fast lightweight approach to origin-destination IP traffic estimation using partial measurements , 2006, IEEE Transactions on Information Theory.

[68]  Bruce Knuteson,et al.  Statistical Challenges with Massive Datasets in Particle Physics , 2003 .

[69]  Carsten Lund,et al.  An information-theoretic approach to traffic matrix estimation , 2003, SIGCOMM '03.

[70]  Monika Henzinger,et al.  Algorithmic Challenges in Web Search Engines , 2004, Internet Math..

[71]  M. Strauss GROUP TESTING IN STATISTICAL SIGNAL RECOVERY , 2006 .

[72]  Bing Yu,et al.  Time-Varying Network Tomography: Router Link Data , 2000 .

[73]  Xiaodong Lin,et al.  Secure, Privacy-Preserving Analysis of Distributed Databases , 2007, Technometrics.

[74]  Bin Yu,et al.  Detection of daytime arctic clouds using MISR and MODIS data , 2007 .

[75]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[77]  Bruce G. Lindsay,et al.  A Report on the Future of Statistics , 2004 .

[78]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[79]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[80]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[81]  Bin Yu,et al.  Maximum pseudo likelihood estimation in network tomography , 2003, IEEE Trans. Signal Process..

[82]  J. C. Jacob,et al.  Large-scale visualization of digital sky surveys , 2000 .

[83]  Martin J. Wainwright,et al.  A variational principle for graphical models , 2005 .

[84]  Ziqiang Liu,et al.  Statistical Principles in Image Modeling , 2007, Technometrics.

[85]  Robert Nowak,et al.  Network Tomography: Recent Developments , 2004 .

[86]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[87]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[88]  Christian Posse,et al.  Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction , 2002, Data Mining and Knowledge Discovery.

[89]  J. Tropp JUST RELAX: CONVEX PROGRAMMING METHODS FOR SUBSET SELECTION AND SPARSE APPROXIMATION , 2004 .

[90]  Jiri Benovsky There are vague objects (in any sense in which there are ordinary objects) , 2008 .

[91]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[92]  Michael I. Jordan,et al.  Scalable statistical bug isolation , 2005, PLDI '05.

[93]  Stephen Emmott,et al.  Towards 2020 Science , 2006 .

[94]  Matthew P. Reed,et al.  Statistics for Digital Human Motion Modeling in Ergonomics , 2007, Technometrics.

[95]  D. Donoho,et al.  Basis pursuit , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[96]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[97]  M. Molinaa,et al.  A Comparative Experimental Study of Hash Functions Applied to Packet Sampling , 2005 .

[98]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[99]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[100]  John M. Graybeal,et al.  A Statistical View of the Transient Signals That Support a Wireless Call , 2007, Technometrics.

[101]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[102]  Tong Zhang,et al.  Graph-Based Semi-Supervised Learning and Spectral Kernel Design , 2008, IEEE Transactions on Information Theory.

[103]  Qinghua Zheng,et al.  Automatic extraction of titles from general documents using machine learning , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[104]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[105]  D. Steinberg,et al.  Technometrics , 2008 .

[106]  G. Ghisellini **title** Asp Conference Series, Vol. **volume**, **publication Year** **editors** , 2000 .

[107]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[108]  Douglas W. Nychka,et al.  Winds from a Bayesian Hierarchical Model: Computation for Atmosphere-Ocean Research , 2003 .

[109]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.

[110]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[111]  Martin J. Wainwright,et al.  On divergences, surrogate loss functions, and decentralized detection , 2005, ArXiv.

[112]  Claudio Gentile,et al.  Tracking the Best Hyperplane with a Simple Budget Perceptron , 2006, COLT.

[113]  Corinna Cortes,et al.  Computational Methods for Dynamic Graphs , 2003 .

[114]  Jati K. Sengupta,et al.  Introduction to Information , 1993 .

[115]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[116]  Jean Meloche,et al.  Statistical Aspects of the Analysis of Data Networks , 2007, Technometrics.

[117]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[118]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.