A statistical perspective on data mining

Abstract Data mining can be regarded as a collection of methods for drawing inferences from data. The aims of data mining, and some of its methods, overlap with those of classical statistics. However, there are some philosophical and methodological differences. We examine these differences, and we describe three approaches to machine learning that have developed largely independently: classical statistics, Vapnik's statistical learning theory, and computational learning theory. Comparing these approaches, we conclude that statisticians and data miners can profit by studying each other's methods and using a judiciously chosen combination of them.

[1]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[2]  J. Copas Regression, Prediction and Shrinkage , 1983 .

[3]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[4]  D. M. Titterington,et al.  Neural Networks: A Review from a Statistical Perspective , 1994 .

[5]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[6]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[7]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[8]  Judea Pearl,et al.  Equivalence and Synthesis of Causal Models , 1990, UAI.

[9]  David J. Spiegelhalter,et al.  Bayesian analysis in expert systems , 1993 .

[10]  Eugene Charniak,et al.  Bayesian Networks without Tears , 1991, AI Mag..

[11]  Irving John Good,et al.  C52. The clustering of random variables , 1979 .

[12]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[13]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[14]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[15]  E. Fowlkes,et al.  Variable selection in clustering , 1988 .

[16]  J J Hopfield,et al.  Neurons with graded response have collective computational properties like those of two-state neurons. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[17]  A. D. Gordon Links Between Clustering and Assignment Procedures , 1986 .

[18]  U. Fayyad,et al.  Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[19]  M. Kendall Theoretical Statistics , 1956, Nature.

[20]  Leslie G. Valiant,et al.  A View of Computational Learning Theory , 1993 .

[21]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[22]  B. S. Everitt,et al.  A finite mixture model for the clustering of mixed-mode data , 1988 .

[23]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[24]  David Heckerman,et al.  Bayesian Networks for Knowledge Discovery , 1996, Advances in Knowledge Discovery and Data Mining.

[25]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[26]  R. Tibshirani,et al.  Penalized Discriminant Analysis , 1995 .

[27]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[28]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[29]  Brian Everitt,et al.  Cluster analysis , 1974 .

[30]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[31]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[32]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[33]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[34]  David J. Hand,et al.  Discrimination and Classification , 1982 .

[35]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[36]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Learning Model , 1990, ALT.

[37]  M. Frydenberg The chain graph Markov property , 1990 .

[38]  R. Tibshirani,et al.  Flexible Discriminant Analysis by Optimal Scoring , 1994 .

[39]  D. Madigan,et al.  Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window , 1994 .

[40]  J. Hartigan Statistical theory in clustering , 1985 .

[41]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[42]  Brian Everitt,et al.  Clustering of large data sets , 1983 .

[43]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .

[44]  R. Sokal,et al.  A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[45]  B. D. Ripley,et al.  [Neural Networks: A Review from Statistical Perspective]: Comment , 1994 .

[46]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[47]  Esen A. Ozkarahan,et al.  Two partitioning type clustering algorithms , 1984, J. Am. Soc. Inf. Sci..

[48]  J. Friedman Regularized Discriminant Analysis , 1989 .

[49]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[50]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[51]  Stanley R. Rotman,et al.  Analysis of Multiple-Angle Microwave Observations of Snow and Ice Using Cluster-Analysis Techniques , 1981 .

[52]  J. Pearl Causal diagrams for empirical research , 1995 .

[53]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[54]  G. Kitagawa,et al.  Akaike Information Criterion Statistics , 1988 .

[55]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[56]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[57]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[58]  D. Pollard Strong Consistency of $K$-Means Clustering , 1981 .

[59]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[60]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[61]  I. Kononenko,et al.  Attribute Selection for Modeling , 1997 .

[62]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[63]  W. Vach,et al.  Neural networks and logistic regression: Part I , 1996 .

[64]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[65]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .

[66]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[67]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[68]  Constantin F. Aliferis,et al.  An Evaluation of an Algorithm for Inductive Learning of Bayesian Belief Networks Using Simulated Data Sets , 1994, UAI.

[69]  Susan T. Dumais,et al.  Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing , 1998 .

[70]  B. Ripley Classification and Clustering in Spatial and Image Data , 1992 .

[71]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[72]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[73]  Philip S. Yu,et al.  Mining Large Itemsets for Association Rules , 1998, IEEE Data Eng. Bull..

[74]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[75]  Edward Yourdon,et al.  Techniques of Program Structure and Design , 1976 .

[76]  Wray L. Buntine,et al.  Graphical models for discovering knowledge , 1996, KDD 1996.

[77]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[78]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[79]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[80]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[81]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[82]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[83]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[84]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[85]  Glenford J. Myers,et al.  Composite/structured design , 1978 .

[86]  Hans-Peter Kriegel,et al.  Clustering for Mining in Large Spatial Databases , 1998, Künstliche Intell..

[87]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[88]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[89]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[90]  SchumacherMartin,et al.  Neural networks and logistic regression: Part II , 1996 .

[91]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[92]  H. Akaike A new look at the statistical model identification , 1974 .

[93]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[94]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[95]  V. Vapnik,et al.  Necessary and Sufficient Conditions for the Uniform Convergence of Means to their Expectations , 1982 .

[96]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[97]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[98]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[99]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[100]  H. Bozdogan,et al.  Multi-sample cluster analysis using Akaike's Information Criterion , 1984 .

[101]  石黒 真木夫,et al.  Akaike information criterion statistics , 1986 .

[102]  Wray L. Buntine Operations for Learning with Graphical Models , 1994, J. Artif. Intell. Res..

[103]  Robert F. Ling,et al.  Classification and Clustering. , 1979 .

[104]  W. Eddy,et al.  Approximate single linkage cluster analysis of large data sets in high-dimensional spaces , 1996 .

[105]  Ramanathan Gnanadesikan,et al.  Methods for statistical data analysis of multivariate observations , 1977, A Wiley publication in applied statistics.

[106]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[107]  Dana Angluin,et al.  Computational learning theory: survey and selected bibliography , 1992, STOC '92.