Top 10 algorithms in data mining

This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.

[1]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[2]  R. C. Messenger,et al.  A Modal Search Technique for Predictive Nominal Scale Multivariate Analysis , 1972 .

[3]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[4]  Frans Coenen,et al.  Tree-based partitioning of date for association rule mining , 2006, Knowledge and Information Systems.

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[6]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[7]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[8]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[9]  Godfried T. Toussaint,et al.  Open Problems in Geometric Methods for Instance-Based Learning , 2002, JCDCG.

[10]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[11]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[12]  Thomas Hofmann,et al.  Non-redundant data clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[13]  J. Bezdek,et al.  Generalized k -nearest neighbor rules , 1986 .

[14]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[15]  Vasant Honavar,et al.  Learning accurate and concise naïve Bayes classifiers from attribute value taxonomies and data , 2006, Knowledge and Information Systems.

[16]  Hisashi Koga,et al.  Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing , 2007, Knowledge and Information Systems.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Korris Fu-Lai Chung,et al.  Knowledge and Information Systems , 2017 .

[19]  Takashi Washio,et al.  Deriving Class Association Rules Based on Levelwise Subspace Clustering , 2005, PKDD.

[20]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[21]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[22]  Xuelong Li,et al.  Supervised tensor learning , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[24]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[25]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[26]  Michel Loève,et al.  Probability Theory I , 1977 .

[27]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[28]  R. Olshen A Conversaton with Leo Breiman , 2001 .

[29]  Philip S. Yu,et al.  Catch the moment: maintaining closed frequent itemsets over a data stream sliding window , 2006, Knowledge and Information Systems.

[30]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[31]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[32]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[33]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Rakesh Agrawal,et al.  Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining , 1998, KDD 1998.

[35]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[36]  Jason R. Chen Making clustering in delay-vector space meaningful , 2006, Knowledge and Information Systems.

[37]  Zhan Li,et al.  Knowledge and Information Systems , 2007 .

[38]  Ron Kohavi,et al.  Lazy Decision Trees , 1996, AAAI/IAAI, Vol. 1.

[39]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[40]  Sam Yuan Sung,et al.  Knowledge and Information Systems , 2006 .

[41]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[42]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[43]  David L. Neuhoff,et al.  Quantization , 2022, IEEE Trans. Inf. Theory.

[44]  Donald Michie,et al.  Expert systems in the micro-electronic age , 1979 .

[45]  Philip S. Yu,et al.  Adding the temporal dimension to search - a case study in publication search , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[46]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[47]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[48]  Ruoming Jin,et al.  Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[49]  Masaki Aono,et al.  Exploring overlapping clusters using dynamic re-scaling and sampling , 2006, Knowledge and Information Systems.

[50]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[51]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[52]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[53]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[54]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[55]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[56]  Godfried T. Toussaint,et al.  Proximity Graphs for Nearest Neighbor Decision Rules: Recent Progress , 2002 .

[57]  Glenn Fung,et al.  SVM Feature Selection for Classification of SPECT Images of Alzheimer's Disease Using Spatial Information , 2005, ICDM.

[58]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[59]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[60]  J. R. Quinlan Discovering rules by induction from large collections of examples Intro-ductory readings in expert s , 1979 .

[61]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[62]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[63]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[64]  Richard A. Olshen,et al.  Risk Estimation for Classification Trees , 2001 .

[65]  Vipin Kumar,et al.  Generalizing the Notion of Confidence , 2005, ICDM.

[66]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[67]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[68]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[69]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[70]  Francesco Bonchi,et al.  On condensed representations of constrained frequent patterns , 2005, Knowledge and Information Systems.

[71]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[72]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[73]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[74]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[75]  Takashi Washio,et al.  A General Framework for Mining Frequent Subgraphs from Labeled Graphs , 2004, Fundam. Informaticae.

[76]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[77]  Tao Li,et al.  Using discriminant analysis for multi-class classification: an experimental investigation , 2006, Knowledge and Information Systems.

[78]  P. Deb Finite Mixture Models , 2008 .

[79]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[80]  Glenn Fung,et al.  SVM feature selection for classification of SPECT images of Alzheimer's disease using spatial information , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[81]  Robert E. Schapire,et al.  How boosting the margin can also boost classifier complexity , 2006, ICML.

[82]  Olfa Nasraoui,et al.  Web data mining: exploring hyperlinks, contents, and usage data , 2008, SKDD.

[83]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[84]  George Karypis,et al.  Gene classification using expression profiles: a feasibility study , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[85]  Kai Ming Ting,et al.  An Instance-weighting Method to Induce Cost-sensitive Trees , 2001 .

[86]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[87]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[88]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[89]  Matjaz Kukar,et al.  Quality assessment of individual classifications in machine learning and data mining , 2006, Knowledge and Information Systems.

[90]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[91]  Peter I. Cowling,et al.  Knowledge and Information Systems , 2006 .

[92]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[93]  John Scott What is social network analysis , 2010 .

[94]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[95]  Thomas Richardson,et al.  Interpretable Boosted Naïve Bayes Classification , 1998, KDD.

[96]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[97]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[98]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[99]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[100]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.