Semi-Supervised Learning Using Hierarchical Mixture Models: Gene Essentiality Case Study

Integrating gene-level data is useful for predicting the role of genes in biological processes. This problem has typically focused on supervised classification, which requires large training sets of positive and negative examples. However, training data sets that are too small for supervised approaches can still provide valuable information. We describe a hierarchical mixture model that uses limited positively labeled gene training data for semi-supervised learning. We focus on the problem of predicting essential genes, where a gene is required for the survival of an organism under particular conditions. We applied cross-validation and found that the inclusion of positively labeled samples in a semi-supervised learning framework with the hierarchical mixture model improves the detection of essential genes compared to unsupervised, supervised, and other semi-supervised approaches. There was also improved prediction performance when genes are incorrectly assumed to be non-essential. Our comparisons indicate that the incorporation of even small amounts of existing knowledge improves the accuracy of prediction and decreases variability in predictions. Although we focused on gene essentiality, the hierarchical mixture model and semi-supervised framework is standard for problems focused on prediction of genes or other features, with multiple data types characterizing the feature, and a small set of positive labels.

[1]  Aldert L. Zomer,et al.  Advances and perspectives in computational prediction of microbial gene essentiality. , 2017, Briefings in functional genomics.

[2]  Chuan Dong,et al.  Comprehensive review of the identification of essential genes using computational methods: focusing on feature implementation and assessment , 2018, Briefings Bioinform..

[3]  Hao Luo,et al.  Accurate prediction of human essential genes using only nucleotide composition and association information , 2016, bioRxiv.

[4]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[5]  Leo Eberl,et al.  Essence of life: essential genes of minimal genomes. , 2011, Trends in cell biology.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[8]  Rok Blagus,et al.  SMOTE for high-dimensional class-imbalanced data , 2013, BMC Bioinformatics.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Michael Q. Zhang,et al.  ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expression data to discover direct/indirect targets of a transcription factor , 2011, Nucleic Acids Res..

[11]  G. Wei,et al.  A new computational strategy for predicting essential genes , 2013, BMC Genomics.

[12]  Xing-Ming Zhao,et al.  OGEE v3: Online GEne Essentiality database with increased coverage of organisms and human cell lines , 2020, Nucleic Acids Res..

[13]  Li Zhao,et al.  Training Set Selection for the Prediction of Essential Genes , 2014, PloS one.

[14]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[15]  Michael J. MacCoss,et al.  A nested mixture model for protein identification using mass spectrometry , 2010, 1011.2087.

[16]  M. Gerstein,et al.  Unlocking the secrets of the genome , 2009, Nature.

[17]  Katerina Kechris,et al.  A graphical model method for integrating multiple sources of genome-scale data , 2013, Statistical applications in genetics and molecular biology.

[18]  J. Winderickx,et al.  Inferring transcriptional modules from ChIP-chip, motif and microarray data , 2006, Genome Biology.

[19]  Anushya Muruganujan,et al.  PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements , 2016, Nucleic Acids Res..

[20]  Wenkai Li,et al.  Network-based methods for predicting essential genes or proteins: a survey , 2019, Briefings Bioinform..

[21]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[22]  Daniel Dvorkin Graphical model methods for integrating diverse sources of genome-scale data , 2013 .

[23]  Feng-Biao Guo,et al.  Geptop: A Gene Essentiality Prediction Tool for Sequenced Bacterial Genomes Based on Orthology and Phylogeny , 2013, PloS one.

[24]  Jay Magidson,et al.  Hierarchical Mixture Models for Nested Data Structures , 2004, GfKl.

[25]  Steffen Heber,et al.  In silico prediction of yeast deletion phenotypes. , 2006, Genetics and molecular research : GMR.

[26]  Diego Villar,et al.  Genome-wide identification of hypoxia-inducible factor binding sites and target genes by a probabilistic model integrating transcription-profiling data and in silico binding site prediction , 2010, Nucleic acids research.

[27]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Hamideh Afsarmanesh,et al.  Semi-supervised self-training for decision tree classifiers , 2017, Int. J. Mach. Learn. Cybern..

[29]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[30]  Amalio Telenti,et al.  Human gene essentiality , 2017, Nature Reviews Genetics.

[31]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[32]  Wei Pan,et al.  A Bayesian approach to joint modeling of protein–DNA binding, gene expression and sequence data , 2010, Statistics in medicine.

[33]  Shili Lin,et al.  Class discovery and classification of tumor samples using mixture modeling of gene expression data - a unified approach , 2004, Bioinform..

[34]  E. Lander,et al.  Identification and characterization of essential genes in the human genome , 2015, Science.

[35]  Jelili Oyelade,et al.  Machine learning approach to gene essentiality prediction: a review , 2021, Briefings Bioinform..

[36]  Giovanni Parmigiani,et al.  Integrating diverse genomic data using gene sets , 2011, Genome Biology.

[37]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[38]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[39]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[40]  M. Yousef,et al.  Sequence-based information-theoretic features for gene essentiality prediction , 2017, BMC Bioinformatics.

[41]  John T. Ormerod,et al.  AdaSampling for Positive-Unlabeled and Label Noise Learning With Bioinformatics Applications , 2019, IEEE Transactions on Cybernetics.

[42]  T. Hastie,et al.  Presence‐Only Data and the EM Algorithm , 2009, Biometrics.

[43]  Xiaohua Hu,et al.  Prediction of essential proteins based on subcellular localization and gene expression correlation , 2017, BMC Bioinformatics.

[44]  Yi Pan,et al.  Prediction of essential proteins based on gene expression programming , 2013, BMC Genomics.

[45]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[46]  Chun Xing Li,et al.  Sequence comparison and essential gene identification with new inter-nucleotide distance sequences. , 2017, Journal of theoretical biology.

[47]  Michael R. Seringhaus,et al.  Predicting essential genes in fungal genomes. , 2006, Genome research.

[48]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[49]  Ren Zhang,et al.  DEG 15, an update of the Database of Essential Genes that includes built-in analysis tools , 2020, Nucleic Acids Res..

[50]  Ney Lemke,et al.  Predicting Essential Genes and Proteins Based on Machine Learning and Network Topological Features: A Comprehensive Review , 2016, Front. Physiol..

[51]  Zhaojie Zhang,et al.  Why are essential genes essential? - The essentiality of Saccharomyces genes , 2015, Microbial cell.

[52]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[53]  Xiao Liu,et al.  Selection of key sequence-based features for prediction of essential genes in 31 diverse bacterial species , 2017, PloS one.

[54]  Norman Pavelka,et al.  Emerging and evolving concepts in gene essentiality , 2017, Nature Reviews Genetics.

[55]  Wei Liu,et al.  Positive unlabeled learning via wrapper-based adaptive sampling , 2017, IJCAI.

[56]  Edith D. Wong,et al.  Saccharomyces Genome Database: the genomics resource of budding yeast , 2011, Nucleic Acids Res..

[57]  Rebecka Jörnsten,et al.  Mixture models with multiple levels, with application to the analysis of multifactor gene expression data. , 2008, Biostatistics.