Probabilistic combination of classification rules and its application to medical diagnosis

Application of machine learning to medical diagnosis entails facing two major issues, namely, a necessity of learning comprehensible models and a need of coping with imbalanced data phenomenon. The first one corresponds to a problem of implementing interpretable models, e.g., classification rules or decision trees. The second issue represents a situation in which the number of examples from one class (e.g., healthy patients) is significantly higher than the number of examples from the other class (e.g., ill patients). Learning algorithms which are prone to the imbalance data return biased models towards the majority class. In this paper, we propose a probabilistic combination of soft rules, which can be seen as a probabilistic version of the classification rules, by introducing new latent random variable called conjunctive feature. The conjunctive features represent conjunctions of values of attribute variables (features) and we assume that for given conjunctive feature the object and its label (class) become independent random variables. In order to deal with the between class imbalance problem, we present a new estimator which incorporates the knowledge about data imbalanceness into hyperparameters of initial probability of objects with fixed class labels. Additionally, we propose a method for aggregating sufficient statistics needed to estimate probabilities in a graph-based structure to speed up computations. At the end, we carry out two experiments: (1) using benchmark datasets, (2) using medical datasets. The results are discussed and the conclusions are drawn.

[1]  Neil D. Lawrence,et al.  Manifold Relevance Determination , 2012, ICML.

[2]  Pedro M. Domingos Bayesian Averaging of Classifiers and the Overfitting Problem , 2000, ICML.

[3]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Erik Strumbelj,et al.  Explanation and Reliability of Individual Predictions , 2013, Informatica.

[5]  Bart De Moor,et al.  Development of a kernel function for clinical data , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[6]  Jerzy Stefanowski,et al.  On rough set based approaches to induction of decision rules , 1998 .

[7]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[8]  Daniel Sánchez,et al.  ART: A Hybrid Classification Model , 2004, Machine Learning.

[9]  Nathalie Japkowicz,et al.  Boosting support vector machines for imbalanced data sets , 2008, Knowledge and Information Systems.

[10]  Konstantin Vorontsov,et al.  Tight Combinatorial Generalization Bounds for Threshold Conjunction Rules , 2011, PReMI.

[11]  Erik Strumbelj,et al.  Explanation and reliability of prediction models: the case of breast cancer recurrence , 2010, Knowledge and Information Systems.

[12]  Geoffrey I. Webb,et al.  Classification Learning Using All Rules , 1998, ECML.

[13]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[14]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[15]  Sergei O. Kuznetsov,et al.  Toxicology Analysis by Means of the JSM-method , 2003, Bioinform..

[16]  Carlos Soares,et al.  Ranking Learning Algorithms: Using IBL and Meta-Learning on Accuracy and Time Results , 2003, Machine Learning.

[17]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[18]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[19]  Nuno Vasconcelos,et al.  Risk minimization, probability elicitation, and cost-sensitive SVMs , 2010, ICML.

[20]  Loïc Cerf,et al.  A Parameter-Free Associative Classification Method , 2008, DaWaK.

[21]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[22]  John F. Roddick,et al.  Association mining , 2006, CSUR.

[23]  Johannes Fürnkranz,et al.  An Analysis of Rule Evaluation Metrics , 2003, ICML.

[24]  Pedro Domingos Bayesian Model Averaging in Rule Induction , 1997 .

[25]  Stefan Kramer,et al.  Margin-Based First-Order Rule Learning , 2006, ILP.

[26]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[27]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[28]  Guoqing Chen,et al.  Building an Associative Classifier Based on Fuzzy Association Rules , 2008, Int. J. Comput. Intell. Syst..

[29]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[30]  Marek Lubicz,et al.  Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients , 2014, Appl. Soft Comput..

[31]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[32]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[33]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[34]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[35]  Jakub M. Tomczak,et al.  Decision rules extraction from data stream in the presence of changing context for diabetes treatment , 2012, Knowledge and Information Systems.

[36]  Nathalie Japkowicz,et al.  Concept-Learning in the Presence of Between-Class and Within-Class Imbalances , 2001, Canadian Conference on AI.

[37]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[38]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[39]  A Cost-Sensitive Iterative Bayes , .

[40]  Gavin C. Cawley,et al.  Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[41]  Eghbal G. Mansoori,et al.  SGERD: A Steady-State Genetic Algorithm for Extracting Fuzzy Classification Rules From Data , 2008, IEEE Transactions on Fuzzy Systems.

[42]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[43]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.

[44]  Igor Kononenko,et al.  Combining Decisions of Multiple Rules , 1992, AIMSA.

[45]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[46]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[47]  Padraig Cunningham,et al.  The problem of bias in training data in regression problems in medical decision support , 2002, Artif. Intell. Medicine.

[48]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[49]  Loïc Cerf,et al.  Parameter-free classification in multi-class imbalanced data sets , 2013, Data Knowl. Eng..

[50]  Rajesh P. N. Rao,et al.  Learning Shared Latent Structure for Image Synthesis and Robotic Imitation , 2005, NIPS.

[51]  Szymon Wilk,et al.  Rough Sets for Handling Imbalanced Data: Combining Filtering and Rule-based Classifiers , 2006, Fundam. Informaticae.

[52]  Wray L. Buntine,et al.  A theory of learning classification rules , 1990 .

[53]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[54]  Sašo Džeroski,et al.  Using the m -estimate in rule induction , 1993 .

[55]  Salvatore Greco,et al.  Mining Pareto-optimal rules with respect to support and confirmation or support and anti-support , 2007, Eng. Appl. Artif. Intell..

[56]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[57]  Roman Słowiński,et al.  Sequential covering rule induction algorithm for variable consistency rough set approaches , 2011, Inf. Sci..

[58]  Lech Polkowski,et al.  Rough Sets in Knowledge Discovery 2 , 1998 .

[59]  Wojciech Kotlowski,et al.  Maximum likelihood rule ensembles , 2008, ICML '08.

[60]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[61]  Christophe Mues,et al.  An experimental comparison of classification algorithms for imbalanced credit scoring data sets , 2012, Expert Syst. Appl..

[62]  J. Tenenbaum,et al.  Generalization, similarity, and Bayesian inference. , 2001, The Behavioral and brain sciences.

[63]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[64]  Neil D. Lawrence,et al.  Ambiguity Modeling in Latent Spaces , 2008, MLMI.

[65]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[66]  Edwin T. Jaynes,et al.  Prior Probabilities , 1968, Encyclopedia of Machine Learning.

[67]  Nada Lavrac,et al.  Selected techniques for data mining in medicine , 1999, Artif. Intell. Medicine.

[68]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[69]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[70]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[71]  Igor Kononenko,et al.  Machine learning for medical diagnosis: history, state of the art and perspective , 2001, Artif. Intell. Medicine.

[72]  Thomas P. Minka,et al.  Bayesian model averaging is not model combination , 2002 .

[73]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[74]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[75]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[76]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[77]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.