Interpretable Ensembles of Classifiers for Uncertain Data With Bioinformatics Applications

Data uncertainty remains a challenging issue in many applications, but few classification algorithms can effectively cope with it. An ensemble approach for uncertain categorical features has recently been proposed, achieving promising results. It consists in biasing the sampling of features for each model in an ensemble so that less uncertain features are more likely to be sampled. Here we extend this idea of biased sampling and propose two new approaches: one for selecting training instances for each model in an ensemble and another for sampling features to be considered when splitting a node in a Random Forest training. We applied these approaches to classify ageing-related genes and predict drugs' side effects based on uncertain features representing protein-protein and protein-chemical interactions. We show that ensembles based on our proposed approaches achieve better predictive performance. In particular, our proposed approaches improved the performance of a Random Forest based on the most sophisticated approach for handling uncertain data in ensembles of this kind. Furthermore, we propose two new approaches for interpreting an ensemble of Naive Bayes classifiers and analyse their results on our datasets of ageing-related genes and drug's side effects.

[1]  A. Freitas,et al.  An Ensemble of Naive Bayes Classifiers for Uncertain Categorical Data , 2021, 2021 IEEE International Conference on Data Mining (ICDM).

[2]  L. Floridi,et al.  Local Explanations via Necessity and Sufficiency: Unifying Theory and Practice , 2021, Minds and Machines.

[3]  Christina B. Azodi,et al.  Opening the Black Box: Interpretable Machine Learning for Geneticists. , 2020, Trends in genetics : TIG.

[4]  Damian Szklarczyk,et al.  STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets , 2018, Nucleic Acids Res..

[5]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[6]  Yong Xu,et al.  Uncertain data classification with additive kernel support vector machine , 2018, Data Knowl. Eng..

[7]  Toshiki Mori,et al.  Balancing the trade-off between accuracy and interpretability in software defect prediction , 2018, Empirical Software Engineering.

[8]  Stefano Nembrini,et al.  The revival of the Gini importance? , 2018, Bioinform..

[9]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[10]  Inbal Yahav,et al.  The Forest or the Trees? Tackling Simpson's Paradox with Classification Trees , 2018 .

[11]  João Pedro de Magalhães,et al.  Human Ageing Genomic Resources: new and updated databases , 2017, Nucleic Acids Res..

[12]  Wei-Dong Chen,et al.  DAF-16/FOXO Transcription Factor in Aging and Longevity , 2017, Front. Pharmacol..

[13]  N. Polacek,et al.  Alterations of the translation apparatus during aging and stress response , 2017, Mechanisms of Ageing and Development.

[14]  Ashish Rajput,et al.  Systematic analysis of the gerontome reveals links between aging and age-related diseases , 2016, Human molecular genetics.

[15]  R. Youle,et al.  The Mitochondrial Basis of Aging. , 2016, Molecular cell.

[16]  Damian Szklarczyk,et al.  STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data , 2015, Nucleic Acids Res..

[17]  Peer Bork,et al.  The SIDER database of drugs and side effects , 2015, Nucleic Acids Res..

[18]  J. Pearl Comment: Understanding Simpson’s Paradox , 2013, Probabilistic and Causal Inference.

[19]  Fabrizio Angiulli,et al.  Nearest Neighbor-Based Classification of Uncertain Data , 2013, TKDD.

[20]  Henrik Boström,et al.  Introducing Uncertainty in Predictive Modeling - Friend or Foe? , 2012, J. Chem. Inf. Model..

[21]  J. de Magalhães,et al.  Genome‐Wide Patterns of Genetic Distances Reveal Candidate Loci Contributing to Human Population‐Specific Traits , 2012, Annals of human genetics.

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  Sau Dan Lee,et al.  Decision Trees for Uncertain Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[24]  Yuni Xia,et al.  UNN: A Neural Network for Uncertain Data Classification , 2010, PAKDD.

[25]  C. Kenyon The genetics of ageing , 2010, Nature.

[26]  R. Apweiler,et al.  On the Importance of Comprehensible Classification Models for Protein Function Prediction , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Biao Qin,et al.  DTU: A Decision Tree for Uncertain Data , 2009, PAKDD.

[28]  Henrik Boström,et al.  Utilizing Information on Uncertainty for In Silico Modeling using Random Forests , 2009 .

[29]  F. Muller,et al.  Trends in oxidative aging theories. , 2007, Free radical biology & medicine.

[30]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[31]  P. Verbeke,et al.  HEAT SHOCK RESPONSE AND AGEING: MECHANISMS AND APPLICATIONS , 2001, Cell biology international.

[32]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[33]  Thomas Richardson,et al.  Interpretable Boosted Naïve Bayes Classification , 1998, KDD.

[34]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..