Probabilistic Diagnostic Tests for Degradation Problems in Supervised Learning

Several studies point out different causes of performance degradation in supervised machine learning. Problems such as class imbalance, overlapping, small-disjuncts, noisy labels, and sparseness limit accuracy in classification algorithms. Even though a number of approaches either in the form of a methodology or an algorithm try to minimize performance degradation, they have been isolated efforts with limited scope. Most of these approaches focus on remediation of one among many problems, with experimental results coming from few datasets and classification algorithms, insufficient measures of prediction power, and lack of statistical validation for testing the real benefit of the proposed approach. This paper consists of two main parts: In the first part, a novel probabilistic diagnostic model based on identifying signs and symptoms of each problem is presented. Thereby, early and correct diagnosis of these problems is to be achieved in order to select not only the most convenient remediation treatment but also unbiased performance metrics. Secondly, the behavior and performance of several supervised algorithms are studied when training sets have such problems. Therefore, prediction of success for treatments can be estimated across classifiers.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[3]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[4]  Milton Friedman,et al.  A Correction: The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1939 .

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  Evangelos E. Milios,et al.  Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets , 2001, AISTATS.

[7]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[8]  D. Longo,et al.  Precision medicine--personalized, problematic, and promising. , 2015, The New England journal of medicine.

[9]  José Salvador Sánchez,et al.  An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets , 2007, CIARP.

[10]  Hans-Peter Piepho,et al.  An Algorithm for a Letter-Based Representation of All-Pairwise Comparisons , 2004 .

[11]  Francisco Herrera,et al.  Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[12]  L. Beutler,et al.  Selecting the most appropriate treatment for each patient , 2015, International journal of clinical and health psychology : IJCHP.

[13]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[14]  Thomas M. Cover,et al.  Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.

[15]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[16]  Francisco Herrera,et al.  SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary , 2018, J. Artif. Intell. Res..

[17]  R. Kliegman,et al.  Overcoming Diagnostic Errors in Medical Practice. , 2017, The Journal of pediatrics.

[18]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[19]  G. Krishna,et al.  The condensed nearest neighbor rule using the concept of mutual nearest neighborhood (Corresp.) , 1979, IEEE Trans. Inf. Theory.

[20]  Gerhard Klimeck,et al.  A Statistical Approach to Increase Classification Accuracy in Supervised Learning Algorithms , 2017, ArXiv.

[21]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[22]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[23]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[24]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[25]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[26]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[27]  Chumphol Bunkhumpornpat,et al.  DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique , 2011, Applied Intelligence.

[28]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[29]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[30]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[31]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[32]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[33]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[34]  H. Levene Robust tests for equality of variances , 1961 .

[35]  Marti J. Anderson,et al.  Distance‐Based Tests for Homogeneity of Multivariate Dispersions , 2006, Biometrics.

[36]  Erin Balogh,et al.  COMMITTEE ON DIAGNOSTIC ERROR IN HEALTH CARE , 2015 .

[37]  Haym Hirsh,et al.  A Quantitative Study of Small Disjuncts , 2000, AAAI/IAAI.

[38]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[39]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[40]  Jian Wu,et al.  Enrich the data density of cluster for imbalanced learning using immune representatives , 2016, 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE).

[41]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[42]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[43]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[44]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[45]  Chris Fraley,et al.  MCLUST: Software for Model-Based Cluster and Discriminant Analysis , 1998 .

[46]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[47]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[48]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[50]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[51]  J. I The Design of Experiments , 1936, Nature.

[52]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[53]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[54]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[55]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[56]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[57]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, ICDM.

[58]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[59]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[60]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[61]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[62]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[63]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[64]  Luca Scrucca,et al.  mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models , 2016, R J..

[65]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[66]  Marti J. Anderson,et al.  Multivariate dispersion as a measure of beta diversity. , 2006, Ecology letters.

[67]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[68]  Zhe Li,et al.  Adaptive Ensemble Undersampling-Boost: A novel learning framework for imbalanced data , 2017, J. Syst. Softw..

[69]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[70]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[71]  Robert B. Fisher,et al.  Classifying imbalanced data sets using similarity based hierarchical decomposition , 2015, Pattern Recognit..

[72]  M. Graber The incidence of diagnostic error in medicine , 2013, BMJ quality & safety.

[73]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .

[74]  Haibo He,et al.  Assessment Metrics for Imbalanced Learning , 2013 .

[75]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[76]  Gustavo E. A. P. A. Batista,et al.  Learning with Class Skews and Small Disjuncts , 2004, SBIA.

[77]  Gustavo E. A. P. A. Batista,et al.  Balancing Strategies and Class Overlapping , 2005, IDA.

[78]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[79]  Haym Hirsh,et al.  The effect of small disjuncts and class distribution on decision tree learning , 2003 .

[80]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[81]  Nathalie Japkowicz,et al.  Concept-Learning in the Presence of Between-Class and Within-Class Imbalances , 2001, Canadian Conference on AI.

[82]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[83]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[84]  Dan L. Longo,et al.  Precision Medicine—Personalized, Problematic, and Promising , 2015 .

[85]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[86]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[87]  Pengpeng Zhao,et al.  Immune Centroids Oversampling Method for Binary Classification , 2015, Comput. Intell. Neurosci..

[88]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[89]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[90]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..