Data Mining at the Interface of Computer Science and Statistics

This chapter is written for computer scientists, engineers, mathematicians, and scientists who wish to gain a better understanding of the role of statistical thinking in modern data mining. Data mining has attracted considerable attention both in the research and commercial arenas in recent years, involving the application of a variety of techniques from both computer science and statistics. The chapter discusses how computer scientists and statisticians approach data from different but complementary viewpoints and highlights the fundamental differences between statistical and computational views of data mining. In doing so we review the historical importance of statistical contributions to machine learning and data mining, including neural networks, graphical models, and flexible predictive modeling. The primary conclusion is that closer integration of computational methods with statistical thinking is likely to become increasingly important in data mining applications.

[1]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[2]  Padhraic Smyth,et al.  A General Probabilistic Framework for Clustering Individuals , 2000, KDD 2000.

[3]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[4]  Michael W. Berry,et al.  Large-Scale Information Retrieval with Latent Semantic Indexing , 1997, Inf. Sci..

[5]  Peter S. Fader,et al.  Which Visits Lead to Purchases? Dynamic Conversion Behavior at e-Commerce Sites , 2000 .

[6]  J. Armstrong,et al.  Derivation of Theory by Means of Factor Analysis or Tom Swift and His Electric Factor Analysis Machine , 2015 .

[7]  David Cox,et al.  Applied Statistics - Principles and Examples , 1981 .

[8]  A. Atkinson Subset Selection in Regression , 1992 .

[9]  H. White,et al.  Data‐Snooping, Technical Trading Rule Performance, and the Bootstrap , 1999 .

[10]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[11]  Daryl Pregibon,et al.  A Statistical Perspective on Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[13]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[14]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[15]  Mary S. Lee Cached Suucient Statistics for Eecient Machine Learning with Large Datasets 1. Caching Suucient Statistics , 1997 .

[16]  G. Kane Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models , 1994 .

[17]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[18]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[19]  David J. Hand,et al.  Data Mining: Statistics and More? , 1998 .

[20]  Anne Lohrli Chapman and Hall , 1985 .

[21]  H. White,et al.  A Reality Check for Data Snooping , 2000 .

[22]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[23]  Padhraic Smyth,et al.  A general probabilistic framework for clustering individuals and objects , 2000, KDD '00.

[24]  David Jensen,et al.  Knowledge Discovery Through Induction with Randomization Testing , 1991 .

[25]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[26]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[27]  Diane Lambert,et al.  What Use is Statistics for Massive Data? , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[28]  M. Wedel,et al.  Market Segmentation: Conceptual and Methodological Foundations , 1997 .

[29]  S. Ian Robertson,et al.  Problem-solving , 2001, Human Thinking.

[30]  Padhraic Smyth,et al.  An Information Theoretic Approach to Rule Induction from Databases , 1992, IEEE Trans. Knowl. Data Eng..

[31]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[32]  H. J. Einhorn ALCHEMY IN THE BEHAVIORAL SCIENCES , 1972 .

[33]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[34]  Padhraic Smyth,et al.  Statistical inference and data mining , 1996, CACM.

[35]  D. M. Titterington,et al.  Statistics and Neural Networks , 2000, Technometrics.

[36]  Brian D. Ripley,et al.  Neural Networks and Related Methods for Classification , 1994 .

[37]  Michael J. A. Berry,et al.  Mastering Data Mining: The Art and Science of Customer Relationship Management , 1999 .

[38]  David Heckerman,et al.  Dependency Networks for Density Estimation, Collaborative Filtering, and Data Visualization , 2000 .

[39]  DeLiang Wang,et al.  Unsupervised Learning: Foundations of Neural Computation , 2001, AI Mag..

[40]  Madhu Sudan,et al.  A statistical perspective on data mining , 1997, Future Gener. Comput. Syst..

[41]  Probability functions on complex pedigrees , 1978 .

[42]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[43]  Stephen D. Bay,et al.  Detecting change in categorical data: mining contrast sets , 1999, KDD '99.

[44]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[45]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[46]  Sheldon M. Ross,et al.  Introduction to probability models , 1975 .

[47]  B. Everitt,et al.  Applied Multivariate Data Analysis. , 1993 .

[48]  David J. Hand,et al.  Deconstructing Statistical Questions , 1994 .

[49]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[50]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[51]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[52]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[53]  A. W. Kemp,et al.  The Dirichlet: A comprehensive model of buying behaviour , 1984 .

[54]  J. Ross Quinlan,et al.  Generating Production Rules from Decision Trees , 1987, IJCAI.

[55]  Carolyn Pillers Dobler,et al.  Mathematical Statistics , 2002 .

[56]  Eric R. Ziegel,et al.  Applied Multivariate Data Analysis , 2002, Technometrics.

[57]  D. M. Titterington,et al.  Analysis of latent structure models with multidimensional latent variables , 2000 .

[58]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[59]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[60]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[61]  G. Reinsel,et al.  Introduction to Mathematical Statistics (4th ed.). , 1980 .

[62]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[63]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[64]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[65]  Alan Stuart,et al.  Data-Dredging Procedures in Survey Analysis , 1966 .

[66]  P Smyth,et al.  Statistical Methods in Medical Research Data Mining: Data Analysis on a Grand Scale? , 2022 .

[67]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[68]  M. Scheerer,et al.  Problem Solving , 1967, Nature.

[69]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[70]  Theodore Johnson,et al.  Squashing flat files flatter , 1999, KDD '99.

[71]  DAVID G. KENDALL,et al.  Introduction to Mathematical Statistics , 1947, Nature.

[72]  J. Richard,et al.  Specification Searches: Ad Hoc Inference with Nonexperimental Data , 1980 .