The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

Many classification studies often times conclude with a summary table which presents performance results of applying various data mining approaches on different datasets. No single method outperforms all methods all the time. Furthermore, the performance of a classification method in terms of its false-positive and false-negative rates may be totally unpredictable. Attempts to minimize any of the previous two rates, may lead to an increase on the other rate. If the model allows for new data to be deemed as unclassifiable when there is not adequate information to classify them, then it is possible for the previous two error rates to be very low but, at the same time, the rate of having unclassifiable new examples to be very high. The root to the above critical problem is the overfitting and overgeneralization behaviors of a given classification approach when it is processing a particular dataset. Although the above situation is of fundamental importance to data mining, it has not been studied from a comprehensive point of view. Thus, this chapter analyzes the above issues in depth. It also proposes a new approach called the HomogeneityBased Algorithm (or HBA) for optimally controlling the previous three error rates. This is done by first formulating an optimization problem. The key development in this chapter is based on a special way for analyzing the space of the training data and then partitioning it according to the data density of different regions of this space. Next, the classification task is pursued based on the previous partitioning of the training space. In this way, the previous three error rates can be controlled in a comprehensive manner. Some preliminary computational results seem to indicate that the proposed approach has a significant potential to fill in a critical gap in current data mining methodologies.

[1]  Basilis Boutsinas,et al.  A method for improving the accuracy of data mining classification algorithms , 2009, Comput. Oper. Res..

[2]  Giovanni Felici,et al.  Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques (Massive Computing) , 2006 .

[3]  Walter A. Kosters,et al.  Genetic Programming for data classification: partitioning the search space , 2004, SAC '04.

[4]  Chung-Chian Hsu,et al.  Extended Naive Bayes classifier for mixed data , 2008, Expert Syst. Appl..

[5]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[6]  Lukasz Kurgan,et al.  Data Mining and Knowledge Discovery Data Mining and Knowledge Discovery , 2002 .

[7]  Geoffrey I. Webb Decision Tree Grafting , 1997, IJCAI.

[8]  Sahibsingh A. Dudani The Distance-Weighted k-Nearest-Neighbor Rule , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[10]  David G. Stork,et al.  Pattern Classification , 1973 .

[11]  Moshe Sipper,et al.  A fuzzy-genetic approach to breast cancer diagnosis , 1999, Artif. Intell. Medicine.

[12]  Warren L. Davis,et al.  Enhancing pattern classification with relational fuzzy neural networks and square bk-products , 2006 .

[13]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[14]  Lawrence B. Holder,et al.  Graph-Based Concept Learning , 2001, FLAIRS Conference.

[15]  Max A. Little,et al.  Suitability of Dysphonia Measurements for Telemonitoring of Parkinson's Disease , 2008, IEEE Transactions on Biomedical Engineering.

[16]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[17]  Masanobu Taniguchi,et al.  Input dependent misclassification costs for cost-sensitive classifiers , 2000 .

[18]  Andreas Weigend,et al.  On overfitting and the effective number of hidden units , 1993 .

[19]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[20]  Jordi Vitrià,et al.  Discriminant ECOC: a heuristic method for application dependent design of error correcting output codes , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Evangelos Triantaphyllou,et al.  Prediction of Diabetes by Employing a New Data Mining Approach Which Balances Fitting and Generalization , 2008, Computer and Information Science.

[22]  Soundar R. T. Kumara,et al.  Generating logical expressions from positive and negative examples via a branch-and-bound approach , 1994, Comput. Oper. Res..

[23]  Marius Ene,et al.  Neural network-based approach to discriminate healthy people from those with Parkinson's disease , 2008 .

[24]  Endre Boros,et al.  Predicting Cause-Effect Relationships from Incomplete Discrete Observations , 1994, SIAM J. Discret. Math..

[25]  Lior Rokach,et al.  Improving Supervised Learning by Sample Decomposition , 2005, Int. J. Comput. Intell. Appl..

[26]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[27]  D. Vaus Analyzing social science data , 2002 .

[28]  David J. Hand,et al.  On Pruning and Averaging Decision Trees , 1995, ICML.

[29]  N. Segata,et al.  Empirical Assessment of Classification Accuracy of Local SVM , 2008 .

[30]  H. Altay Güvenir,et al.  Maximizing Benefit of Classifications Using Feature Intervals , 2003, KES.

[31]  Yong Shi,et al.  A rough set-based multiple criteria linear programming approach for the medical diagnosis and prognosis , 2009, Expert Syst. Appl..

[32]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[33]  Andrew W. Moore,et al.  K-means and Hierarchical Clustering , 2004 .

[34]  David Sands,et al.  Improvement theory and its applications , 1999 .

[35]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[36]  Duc Truong Pham,et al.  Technique for Selecting Examples in Inductive Learning , 2000 .

[37]  Geoffrey I. Webb Further Experimental Evidence against the Utility of Occam's Razor , 1996, J. Artif. Intell. Res..

[38]  Seral Özsen,et al.  Attribute weighting via genetic algorithms for attribute weighted artificial immune system (AWAIS) and its application to heart disease and liver disorders problems , 2009, Expert Syst. Appl..

[39]  Wlodzislaw Duch,et al.  Prototype-Based Threshold Rules , 2006, ICONIP.

[40]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[41]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[42]  Ioannis G. Tsoulos,et al.  Selecting and constructing features using grammatical evolution , 2008, Pattern Recognit. Lett..

[43]  Ofer Melnik,et al.  Decision Region Connectivity Analysis: A Method for Analyzing High-Dimensional Classifiers , 2002, Machine Learning.

[44]  Johan A. K. Suykens,et al.  Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis , 2002, Neural Computation.

[45]  Vili Podgorelec,et al.  Decision Trees: An Overview and Their Use in Medicine , 2002, Journal of Medical Systems.

[46]  Vojislav Kecman,et al.  Comparisons of QP and LP Based Learning from Empirical Data , 2001, IEA/AIE.

[47]  Yishay Mansour,et al.  Generalization Bounds for Decision Trees , 2000, COLT.

[48]  Murray Smith,et al.  Neural Networks for Statistical Modeling , 1993 .

[49]  Chris Carter,et al.  Multiple decision trees , 2013, UAI.

[50]  Kemal Polat,et al.  Breast cancer and liver disorders classification using artificial immune recognition system (AIRS) with performance evaluation by fuzzy resource allocation mechanism , 2007, Expert Syst. Appl..

[51]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[52]  Ferenc Szeifert,et al.  Supervised fuzzy clustering for the identification of fuzzy classifiers , 2003, Pattern Recognit. Lett..

[53]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[54]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[55]  Lukasz A. Kurgan,et al.  Knowledge discovery approach to automated cardiac SPECT diagnosis , 2001, Artif. Intell. Medicine.

[56]  Evangelos Triantaphyllou,et al.  Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques , 2009 .

[57]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[58]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[59]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[60]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[61]  Richard Nock,et al.  On Learning Decision Committees , 1995, ICML.

[62]  J W Thomas,et al.  Accuracy of risk-adjusted mortality rate as a measure of hospital quality of care. , 1999, Medical care.

[63]  Kemal Polat,et al.  A new medical decision making system: Least square support vector machine (LSSVM) with Fuzzy Weighting Pre-processing , 2007, Expert Syst. Appl..

[64]  Hervé Abdi,et al.  A NEURAL NETWORK PRIMER , 1994 .

[65]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[66]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[67]  Tin Kam Ho,et al.  Building projectable classifiers of arbitrary complexity , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[68]  Michael J. Pazzani,et al.  Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[69]  Hisao Ishibuchi,et al.  Constructing fuzzy ensembles for pattern classification problems , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[70]  Mohammed Al-Shalalfa,et al.  Effectiveness of Fuzzy Discretization for Class Association Rule-Based Classification , 2008, ISMIS.

[71]  Jack Ritter,et al.  An efficient bounding sphere , 1990 .

[72]  M Mernik,et al.  ROSE: decision trees, automatic learning and their applications in cardiac medicine. , 1995, Medinfo. MEDINFO.

[73]  Gisbert Schneider,et al.  Support vector machine applications in bioinformatics. , 2003, Applied bioinformatics.

[74]  K. Bennett,et al.  A support vector machine approach to decision trees , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[75]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[76]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[77]  Andrew Engel,et al.  An integer support vector machine , 2005, Sixth International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing and First ACIS International Workshop on Self-Assembling Wireless Network.

[78]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[79]  Peter Kokol,et al.  The Limitations of Decision Trees and Automatic Learning in Real World Medical Decision Making , 1998, MedInfo.

[80]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[81]  Keith C. C. Chan,et al.  Classification with degree of membership: a fuzzy approach , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[82]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[83]  Ben Shneiderman,et al.  Interactively Exploring Hierarchical Clustering Results , 2002, Computer.

[84]  Yves Kodratoff,et al.  Machine Learning — EWSL-91 , 1991, Lecture Notes in Computer Science.

[85]  Zhi-Hua Zhou,et al.  Hybrid decision tree , 2002, Knowl. Based Syst..

[86]  B.V. Dasarathy,et al.  A composite classifier system design: Concepts and methodology , 1979, Proceedings of the IEEE.

[87]  Peter Kokol,et al.  The Limitations of Decision Trees and Automatic Learning in Real World Medical Decision Making , 2004, Journal of Medical Systems.

[88]  Evangelos Triantaphyllou Data Mining and Knowledge Discovery via Logic-Based Methods: Theory, Algorithms, and Applications , 2010 .

[89]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[90]  Leszek Rutkowski,et al.  Flexible neuro-fuzzy systems , 2003, IEEE Trans. Neural Networks.

[91]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[92]  Nello Cristianini,et al.  Further results on the margin distribution , 1999, COLT '99.

[93]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[94]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[95]  P. Greig-Smith,et al.  The Use of Random and Contiguous Quadrats in the Study of the Structure of Plant Communities , 1952 .

[96]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[97]  Sholom M. Weiss,et al.  An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods , 1989, IJCAI.

[98]  Rudy Setiono,et al.  Generating concise and accurate classification rules for breast cancer diagnosis , 2000, Artif. Intell. Medicine.

[99]  Ludmil Dakovski,et al.  Learning and classification with prime implicants applied to medical data diagnosis , 2007, CompSysTech '07.

[100]  Evangelos Triantaphyllou,et al.  An application of a new meta-heuristic for optimizing the classification accuracy when analyzing some medical datasets , 2009, Expert Syst. Appl..

[101]  Michael J. Pazzani,et al.  On learning multiple descriptions of a concept , 1994, Proceedings Sixth International Conference on Tools with Artificial Intelligence. TAI 94.

[102]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[103]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[104]  Visakan Kadirkamanathan,et al.  Statistical Control of RBF-like Networks for Classification , 1997, ICANN.

[105]  Alexander Sczyrba,et al.  IsoSVM – Distinguishing isoforms and paralogs on the protein level , 2006, BMC Bioinformatics.

[106]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[107]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[108]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[109]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[110]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[111]  Xiao Liang,et al.  A Novel Classification Algorithm Based on Fuzzy Kernel Multiple Hyperspheres , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[112]  Mu-Chen Chen,et al.  Credit scoring with a data mining approach based on support vector machines , 2007, Expert Syst. Appl..

[113]  Lior Rokach,et al.  Decision-tree instance-space decomposition with grouped gain-ratio , 2007, Inf. Sci..

[114]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[115]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[116]  Rudolf Kruse,et al.  Obtaining interpretable fuzzy classification rules from medical data , 1999, Artif. Intell. Medicine.

[117]  Yuh-Jye Lee,et al.  SSVM: A Smooth Support Vector Machine for Classification , 2001, Comput. Optim. Appl..

[118]  M. C. Sinclair,et al.  Classification rule mining for automatic credit approval using genetic programming , 2007, 2007 IEEE Congress on Evolutionary Computation.

[119]  Noel M. Tichy,et al.  An Analysis of Clique Formation and Structure in Organizations. , 1973 .

[120]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[121]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..