Mining Functionally Related Genes with Semi-Supervised Learning

The study of biological processes can greatly benefit from tools that automatically predict gene functions or directly cluster genes based on shared functionality. Existing data mining methods predict protein functionality by exploiting data obtained from high-throughput experiments or meta-scale information from public databases. Most existing prediction tools are targeted at predicting protein functions that are described in the gene ontology (GO). However, in many cases biologists wish to discover functionally related genes for which GO terms are inadequate. In this paper, we introduce a rich set of features and use them in conjunction with semisupervised learning approaches in order to expand an initial set of seed genes to a larger cluster of functionally related genes. Among all the semi-supervised methods that were evaluated, the framework of learning with positive and unlabeled examples (LPU) is shown to be especially appropriate for mining functionally related genes. When evaluated on experimentally validated benchmark data, the LPU approaches1 significantly outperform a standard supervised learning algorithm as well as an established state-of-the-art method. Given an initial set of seed genes, our best performing approach could be used to mine functionally related genes in a wide range of organisms.

[1]  Mehmet M. Dalkilic,et al.  Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function , 2009, Genome Biology.

[2]  Igor Jurisica,et al.  Functional topology in a network of protein interactions , 2004, Bioinform..

[3]  Mikhail Belkin,et al.  Laplacian Support Vector Machines Trained in the Primal , 2009, J. Mach. Learn. Res..

[4]  Soumen Roy,et al.  Systems biology beyond degree, hubs and scale-free networks: the case for multiple metrics in complex networks , 2012, Systems and Synthetic Biology.

[5]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[6]  Erik L. L. Sonnhammer,et al.  InParanoid 7: new algorithms and tools for eukaryotic orthology analysis , 2009, Nucleic Acids Res..

[7]  M. S. Mukhtar,et al.  Independently Evolved Virulence Effectors Converge onto Hubs in a Plant Immune System Network , 2011, Science.

[8]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[9]  Xin Chen,et al.  PAIR: the predicted Arabidopsis interactome resource , 2010, Nucleic Acids Res..

[10]  Jungwon Yoon,et al.  The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community , 2003, Nucleic Acids Res..

[11]  John Z. Kiss,et al.  Plant tropisms: from Darwin to the International Space Station. , 2013, American journal of botany.

[12]  Kengo Kinoshita,et al.  ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis , 2006, Nucleic Acids Res..

[13]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[14]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[15]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2011 update , 2010, Nucleic Acids Res..

[16]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[17]  Philip S. Yu,et al.  G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery , 2009, Nucleic Acids Res..

[18]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[19]  John Quackenbush,et al.  Seeded Bayesian Networks: Constructing genetic networks from microarray data , 2008, BMC Systems Biology.

[20]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[21]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[22]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[23]  Igor Jurisica,et al.  Inferring the functions of longevity genes with modular subnetwork biomarkers of Caenorhabditis elegans aging , 2010, Genome Biology.

[24]  Xing-Ming Zhao,et al.  Gene function prediction using labeled and unlabeled data , 2008, BMC Bioinformatics.

[25]  S. Rhee,et al.  AraCyc: A Biochemical Pathway Database for Arabidopsis1 , 2003, Plant Physiology.

[26]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[27]  Jonathan D. Wren,et al.  A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide , 2009, Bioinform..

[28]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[29]  Martin Schindler,et al.  AthaMap: From in silico Data to Real Transcription Factor Binding Sites , 2006, Silico Biol..

[30]  Suzanna Lewis,et al.  Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium , 2011, Briefings Bioinform..

[31]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[32]  Pedro Larrañaga,et al.  Learning Bayesian classifiers from positive and unlabeled examples , 2007, Pattern Recognit. Lett..

[33]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[34]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[35]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[36]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[37]  Mark Gerstein,et al.  Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique , 2010, BMC Bioinformatics.

[38]  J. R. Green,et al.  Global investigation of protein–protein interactions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide sequences , 2008, Nucleic acids research.

[39]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[40]  Jean-Philippe Vert,et al.  Supervised inference of gene regulatory networks from positive and unlabeled examples. , 2013, Methods in molecular biology.

[41]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[42]  Saranyan K. Palaniswamy,et al.  AGRIS and AtRegNet. A Platform to Link cis-Regulatory Elements and Transcription Factors into Regulatory Networks1[W][OA] , 2006, Plant Physiology.