A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness

Classification algorithms are the most commonly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalability, integration, comprehensibility, stability, and interestingness. This study compares the classification of algorithm accuracies, speed (CPU time consumed) and robustness for various datasets and their implementation techniques. The data miner selects the model mainly with respect to classification accuracy; therefore, the performance of each classifier plays a crucial role for selection. Complexity is mostly dominated by the time required for classification. In terms of complexity, the CPU time consumed by each classifier is implied here. The study first discusses the application of certain classification models on multiple datasets in three stages: first, implementing the algorithms on original datasets; second, implementing the algorithms on the same datasets where continuous variables are discretised; and third, implementing the algorithms on the same datasets where principal component analysis is applied. The accuracies and the speed of the results are then compared. The relationship of dataset characteristics and implementation attributes between accuracy and CPU time is also examined and debated. Moreover, a regression model is introduced to show the correlating effect of dataset and implementation conditions on the classifier accuracy and CPU time. Finally, the study addresses the robustness of the classifiers, measured by repetitive experiments on both noisy and cleaned datasets.

[1]  Xiaoning Zhang,et al.  Data Mining for Network Intrusion Detection: A Comparison of Alternative Methods , 2001, Decis. Sci..

[2]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[3]  Thomas Hill Statistics: Methods and Applications , 2005 .

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Laviniu Aurelian Badulescu ATTRIBUTE SELECTION MEASURE IN DECISION TREE GROWING , 2007 .

[6]  Richi Nayak,et al.  Data Mining For Lifetime Prediction of Metallic Components , 2006, AusDM.

[7]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[8]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[9]  Lingjun Meng,et al.  Profiling novel classification algorithms Artificial Immune Systems , 2008, 2008 7th IEEE International Conference on Cybernetic Intelligent Systems.

[10]  Leslie Pack Kaelbling,et al.  Associative methods in reinforcement learning: an empirical study , 1994, COLT 1994.

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  Monica Chiarini Tremblay,et al.  Identifying fall-related injuries: Text mining the electronic medical record , 2009, Inf. Technol. Manag..

[13]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  Richi Nayak,et al.  The Use of Various Data Mining and Feature Selection Methods in the Analysis of a Population Survey Dataset , 2007, AIDM.

[17]  Geoffrey I. Webb,et al.  To Select or To Weigh: A Comparative Study of Linear Combination Schemes for SuperParent-One-Dependence Estimators , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Rashmi Data Mining: A Knowledge Discovery Approach , 2012 .

[19]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[20]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[21]  João Gama,et al.  On Data and Algorithms: Understanding Inductive Performance , 2004, Machine Learning.

[22]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[23]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[24]  Laviniu Aurelian Badulescu The choice of the attribute selection measure in Decision Tree induction , 2007 .

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  David J. Hand,et al.  Mining Supervised Classification Performance Studies: A Meta-Analytic Investigation , 2008, J. Classif..

[27]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[28]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[29]  Teuvo Kohonen,et al.  Improved versions of learning vector quantization , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[30]  Jason Brownlee,et al.  Clonal selection theory and Clonalg: the clonal selection classification algorithm (CSCA) , 2005 .

[31]  Michael G. Madden,et al.  The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data , 2005, Knowl. Based Syst..

[32]  Kweku-Muata Osei-Bryson,et al.  Reexamining the impact of information technology investment on productivity using regression tree and multivariate adaptive regression splines (MARS) , 2008, Inf. Technol. Manag..

[33]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[34]  Carlos Soares,et al.  Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results , 2003, Machine Learning.

[35]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[36]  Marcin S. Szczuka,et al.  RSES and RSESlib - A Collection of Tools for Rough Set Computations , 2000, Rough Sets and Current Trends in Computing.

[37]  Jie Chen,et al.  Analysis of Breast Feeding Data Using Data Mining Methods , 2006, AusDM.

[38]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[39]  Lech Polkowski,et al.  Rough Sets in Knowledge Discovery 2 , 1998 .

[40]  Zuhal Tanrikulu,et al.  A Comparative Framework for Evaluating Classification Algorithms , 2010 .

[41]  A. B. Watkins,et al.  A resource limited artificial immune classifier , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[42]  John H. Maindonald,et al.  Data Mining Methodological Weaknesses and Suggested Fixes , 2006, AusDM.

[43]  Luis von Ahn,et al.  Matchin: eliciting user preferences with an online game , 2009, CHI.

[44]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[45]  Alex Berson,et al.  Building Data Mining Applications for CRM , 1999 .

[46]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[47]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[48]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[49]  Fernando José Von Zuben,et al.  Learning and optimization using the clonal selection principle , 2002, IEEE Trans. Evol. Comput..

[50]  Dean Diepeveen,et al.  An Investigation Into the Application of Data Mining Techniques to Characterize Agricultural Soil Profiles , 2007 .

[51]  Jonathan Timmis,et al.  Artificial Immune Recognition System (AIRS): An Immune-Inspired Supervised Learning Algorithm , 2004, Genetic Programming and Evolvable Machines.

[52]  Mike James,et al.  Classification Algorithms , 1986, Encyclopedia of Machine Learning and Data Mining.

[53]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[54]  Lior Rokach,et al.  Data Mining and Knowledge Discovery Handbook, 2nd ed , 2010, Data Mining and Knowledge Discovery Handbook, 2nd ed..

[55]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[56]  Nabil A. Ismail,et al.  Artificial Immune Clonal Selection Classification Algorithms for Classifying Malware and Benign Processes Using API Call Sequences , 2010 .

[57]  Yu-Shan Shih,et al.  QUEST User Manual , 2004 .

[58]  Andrew Watkins,et al.  Exploiting immunological metaphors in the development of serial, parallel and distributed learning algorithms , 2005 .

[59]  J. R. Quinlan,et al.  Comparing connectionist and symbolic learning methods , 1994, COLT 1994.

[60]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[61]  David Biggs,et al.  A method of choosing multiway partitions for classification and decision trees , 1991 .

[62]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[63]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[64]  Hsinchun Chen,et al.  A comparison of fraud cues and classification methods for fake escrow website detection , 2009, Inf. Technol. Manag..

[65]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[66]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[67]  Chao-Ton Su,et al.  Multiclass MTS for Simultaneous Feature Selection and Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.