Relating ensemble diversity and performance: A study in class noise detection

The advantage of ensemble methods over single methods is their ability to correct the errors of individual ensemble members and thereby improve the overall ensemble performance. This paper explores the relation between ensemble diversity and noise detection performance in the context of ensemble-based class noise detection by studying different diversity measures on a range of heterogeneous noise detection ensembles. In the empirical analysis the majority and the consensus ensemble voting schemes are studied. It is shown that increased diversity of ensembles using the majority voting scheme does not lead to better noise detection performance and may even degrade the performance of heterogeneous noise detection ensembles. On the other hand, for consensus-based noise detection ensembles the results show that more diverse ensembles achieve higher precision of class noise detection, whereas less diverse ensembles lead to higher recall of noise detection and higher F-scores.

[1]  Taghi M. Khoshgoftaar,et al.  Boosted Noise Filters for Identifying Mislabeled Data , 2005 .

[2]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  G. Yule On the Association of Attributes in Statistics: With Illustrations from the Material of the Childhood Society, &c , 1900 .

[5]  Nada Lavrac,et al.  Advances in Class Noise Detection , 2010, ECAI.

[6]  Grigorios Tsoumakas,et al.  Pruning an ensemble of classifiers via reinforcement learning , 2009, Neurocomputing.

[7]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[8]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[9]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[10]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[11]  K. Ghédira,et al.  Ensemble classifiers for drift detection and monitoring in dynamical environments , 2013 .

[12]  Gavin Brown,et al.  "Good" and "Bad" Diversity in Majority Vote Ensembles , 2010, MCS.

[13]  Nada Lavrac,et al.  Ensemble-based noise detection: noise ranking and visual performance evaluation , 2012, Data Mining and Knowledge Discovery.

[14]  Grigorios Tsoumakas,et al.  Focused Ensemble Selection: A Diversity-Based Method for Greedy Ensemble Selection , 2008, ECAI.

[15]  Nada Lavrac,et al.  Active subgroup mining: a case study in coronary heart disease risk group detection , 2003, Artif. Intell. Medicine.

[16]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[17]  G. Yule,et al.  On the association of attributes in statistics, with examples from the material of the childhood society, &c , 1900, Proceedings of the Royal Society of London.

[18]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[20]  Rich Caruana,et al.  Getting the Most Out of Ensemble Selection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[22]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[23]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[24]  Fabio Roli,et al.  A theoretical and experimental analysis of linear combiners for multiple classifier systems , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Gaurav Pandey,et al.  A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics , 2013, 2013 IEEE 13th International Conference on Data Mining.

[26]  Tony R. Martinez,et al.  Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[27]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[28]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[29]  Anne M. P. Canuto,et al.  Using good and bad diversity measures in the design of ensemble systems: A genetic algorithm approach , 2013, 2013 IEEE Congress on Evolutionary Computation.

[30]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[31]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[32]  Yang Yu,et al.  Diversity Regularized Ensemble Pruning , 2012, ECML/PKDD.

[33]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[34]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[35]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[36]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[37]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[38]  Christopher J. Merz,et al.  Using Correspondence Analysis to Combine Classifiers , 1999, Machine Learning.

[39]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[40]  Padraig Cunningham,et al.  Using Diversity in Preparing Ensembles of Classifiers Based on Different Feature Subsets to Minimize Generalization Error , 2001, ECML.

[41]  Taghi M. Khoshgoftaar,et al.  Detecting Noisy Instances with the Ensemble Filter: a Study in Software Quality Estimation , 2006, Int. J. Softw. Eng. Knowl. Eng..

[42]  Carla E. Brodley,et al.  FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection , 2012, Data Mining and Knowledge Discovery.

[43]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[44]  Taghi M. Khoshgoftaar,et al.  Enhancing software quality estimation using ensemble-classifier based noise filtering , 2005, Intell. Data Anal..

[45]  Carlos Soares,et al.  Outlier Detection using Clustering Methods: a data cleaning application , 2004 .

[46]  Carla E. Brodley,et al.  Anomaly Detection Using an Ensemble of Feature Models , 2010, 2010 IEEE International Conference on Data Mining.

[47]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[48]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[49]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[50]  Robi Polikar Ensemble learning , 2009, Scholarpedia.

[51]  K. Pearson VII. Note on regression and inheritance in the case of two parents , 1895, Proceedings of the Royal Society of London.

[52]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[53]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[54]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[55]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[56]  WuXindong,et al.  Class noise vs. attribute noise , 2004 .

[57]  Stephen Kwek,et al.  A boosting approach to remove class label noise , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[58]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[59]  Jaideep Srivastava,et al.  Diversity in Combinations of Heterogeneous Classifiers , 2009, PAKDD.

[60]  H WittenIan,et al.  The WEKA data mining software , 2009 .