Phenotype Prediction with Semi-supervised Classification Trees

In this work, we address the task of phenotypic traits prediction using methods for semi-supervised learning. More specifically, we propose to use supervised and semi-supervised classification trees as well as supervised and semi-supervised random forests of classification trees. We consider 114 datasets for different phenotypic traits referring to 997 microbial species. These datasets present a challenge for the existing machine learning methods: they are not labelled/annotated entirely and their distribution is typically imbalanced. We investigate whether approaching the task of phenotype prediction as a semi-supervised learning task can yield improved predictive performance. The results suggest that the semi-supervised methodology considered here is especially helpful when using single trees, especially when the amount of labeled data ranges from 20 to 40%. Similar improvements can be seen when the presence of the phenotype is very imbalanced.

[1]  Tomislav Šmuc,et al.  Proteome sequence features carry signatures of the environmental niche of prokaryotes , 2011, BMC Evolutionary Biology.

[2]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[3]  Frederik Schulz,et al.  Prediction of microbial phenotypes based on comparative genomics , 2015, BMC Bioinformatics.

[4]  Harry Zhang,et al.  An Extensive Empirical Study on Semi-supervised Learning , 2010, 2010 IEEE International Conference on Data Mining.

[5]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[6]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[7]  Robert G. Beiko,et al.  Efficient learning of microbial genotype-phenotype association rules , 2010, Bioinform..

[8]  Bernhard Schölkopf,et al.  Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[9]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[10]  David S. Wishart,et al.  BacMap: an interactive picture atlas of annotated bacterial genomes , 2004, Nucleic Acids Res..

[11]  Saso Dzeroski,et al.  Tree ensembles for predicting structured outputs , 2013, Pattern Recognit..

[12]  Nikos Kyrpides,et al.  The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification , 2014, Nucleic Acids Res..

[13]  Tobias Warnecke,et al.  Global Shifts in Genome and Proteome Composition Are Very Tightly Coupled , 2015, Genome biology and evolution.

[14]  Hubert Rehrauer,et al.  A global network of coexisting microbes from environmental and whole-genome sequence data. , 2010, Genome research.

[15]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Tomislav Šmuc,et al.  The landscape of microbial phenotypic traits and associated genes , 2016, Nucleic acids research.

[18]  Hendrik Blockeel,et al.  Efficient Algorithms for Decision Tree Cross-validation , 2001, J. Mach. Learn. Res..

[19]  Fabio Gagliardi Cozman,et al.  Unlabeled Data Can Degrade Classification Performance of Generative Classifiers , 2002, FLAIRS.

[20]  Michelangelo Ceci,et al.  Semi-supervised classification trees , 2017, Journal of Intelligent Information Systems.

[21]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[22]  Doug Hyatt,et al.  Quality scores for 32,000 genomes , 2014, Standards in genomic sciences.

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .