论文信息 - Evaluation Measures for Multi-class Subgroup Discovery

Evaluation Measures for Multi-class Subgroup Discovery

Subgroup discovery aims at finding subsets of a population whose class distribution is significantly different from the overall distribution. It has previously predominantly been investigated in a two-class context. This paper investigates multi-class subgroup discovery methods. We consider six evaluation measures for multi-class subgroups, four of them new, and study their theoretical properties. We extend the two-class subgroup discovery algorithm CN2-SD to incorporate the new evaluation measures and a new weighting scheme inspired by AdaBoost. We demonstrate the usefulness of multi-class subgroup discovery experimentally, using discovered subgroups as features for a decision tree learner. Not only is the number of leaves of the decision tree reduced with a factor between 8 and 16 on average, but significant improvements in accuracy and AUC are achieved with particular evaluation measures and settings. Similar performance improvements can be observed when using naive Bayes.

Peter A. Flach | Tarek Abudawood | Tarek Abudawood

[1] Yves Kodratoff,et al. Machine Learning — EWSL-91 , 1991, Lecture Notes in Computer Science.

[2] Peter Clark,et al. Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[3] Roberto Battiti,et al. Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[4] Henrik Boström,et al. Covering vs. Divide-and-Conquer for Top-Down Induction of Logic Programs , 1995, IJCAI.

[5] Alan Bundy,et al. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence - IJCAI-95 , 1995 .

[6] Willi Klösgen,et al. Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[7] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[8] Nello Cristianini,et al. Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[9] Nicholas I. Fisher,et al. Bump hunting in high-dimensional data , 1999, Stat. Comput..

[10] Peter A. Flach,et al. Rule Evaluation Measures: A Unifying View , 1999, ILP.

[11] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12] Ian Witten,et al. Data Mining , 2000 .

[13] Mitsuru Ishizuka,et al. PRICAI 2002: Trends in Artificial Intelligence , 2002, Lecture Notes in Computer Science.

[14] Boonserm Kijsirikul,et al. Adaptive Directed Acyclic Graphs for Multiclass Classification , 2002, PRICAI.

[15] W. Klösgen. Data mining tasks and methods: Subgroup discovery: deviation analysis , 2002 .

[16] Chih-Jen Lin,et al. A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[17] Jan M. Zytkow,et al. Handbook of Data Mining and Knowledge Discovery , 2002 .

[18] Robert E. Schapire,et al. The Boosting Approach to Machine Learning An Overview , 2003 .

[19] David D. Denison,et al. Nonlinear estimation and classification , 2003 .

[20] Peter Clark,et al. The CN2 induction algorithm , 2004, Machine Learning.

[21] F. Fleuret. Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[22] Peter A. Flach,et al. Subgroup Discovery with CN2-SD , 2004, J. Mach. Learn. Res..

[23] Johannes Fürnkranz,et al. ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.

[24] Xin Jin,et al. Machine Learning Techniques and Chi-Square Feature Selection for Cancer Classification Using SAGE Gene Expression Profiles , 2006, BioDM.

[25] Ah-Hwee Tan,et al. Data Mining for Biomedical Applications , 2006, Lecture Notes in Computer Science.

[26] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..