Integrative approaches to the prediction of protein functions based on the feature selection

BackgroundProtein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue.ResultsWe present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO) terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR), and a kernel based L1-norm regularized logistic regression (KL1LR). In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among a group of protein functions that have the same parent in the GO hierarchy.ConclusionsIn contrast to previous integration methods, our approaches not only increase the prediction quality but also gather information about highly contributing data sources for each protein function. This information can help researchers collect relevant data sources for annotating protein functions.

[1]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[2]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[3]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[4]  T. M. Murali,et al.  VIRGO: computational prediction of gene functions , 2006, Nucleic Acids Res..

[5]  Kui Zhang,et al.  Prediction of protein function using protein-protein interaction data , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[6]  Dong Xu,et al.  Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. , 2004, Nucleic acids research.

[7]  Lukasz A. Kurgan,et al.  Accurate sequence-based prediction of catalytic residues , 2008, Bioinform..

[8]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[9]  G. Yancopoulos,et al.  Ror2, encoding a receptor-like tyrosine kinase, is required for cartilage and growth plate development , 2000, Nature Genetics.

[10]  Lukasz A. Kurgan,et al.  SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences , 2008, BMC Bioinformatics.

[11]  N. Morris,et al.  A fibrillar collagen gene, Col11a1, is essential for skeletal morphogenesis , 1995, Cell.

[12]  Matthew A. Hibbs,et al.  Discovery of biological networks from diverse functional genomic data , 2005, Genome Biology.

[13]  D. Durant,et al.  Skeletal overexpression of connective tissue growth factor impairs bone formation and causes osteopenia. , 2008, Endocrinology.

[14]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[15]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[16]  M. Gerstein,et al.  Assessing the limits of genomic data integration for predicting protein networks. , 2005, Genome research.

[17]  Bing Niu,et al.  Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection. , 2009, Biochemical and biophysical research communications.

[18]  Simon Kasif,et al.  Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data , 2007, PloS one.

[19]  R. Mann,et al.  Human Physiology , 1839, Nature.

[20]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[22]  Y. Kan,et al.  The binding of the ubiquitous transcription factor Sp1 at the locus control region represses the expression of beta-like globin genes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[23]  S. Kasif,et al.  Whole-genome annotation by using evidence integration in functional-linkage networks. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[25]  Ozlem Keskin,et al.  A survey of available tools and web servers for analysis of protein-protein interactions and interfaces , 2008, Briefings Bioinform..

[26]  T. Joshi,et al.  Genome-scale gene function prediction using multiple sources of high-throughput data in yeast Saccharomyces cerevisiae. , 2004, Omics : a journal of integrative biology.

[27]  Ting Chen,et al.  An integrated probabilistic model for functional prediction of proteins , 2003, RECOMB '03.

[28]  Anthony J. Bonner,et al.  Connectionist Approaches for Predicting Mouse Gene Function from Gene Expression , 2006, ICONIP.

[29]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[31]  Ting Chen,et al.  Diffusion kernel-based logistic regression models for protein function prediction. , 2006, Omics : a journal of integrative biology.

[32]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[33]  Judith A. Blake,et al.  The mouse genome database (MGD): new features facilitating a model system , 2006, Nucleic Acids Res..

[34]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[35]  Walter L. Ruzzo,et al.  A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data , 2006, BMC Bioinformatics.

[36]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[37]  R. Derynck,et al.  The tumor suppressor Smad4/DPC4 and transcriptional adaptor CBP/p300 are coactivators for smad3 in TGF-beta-induced transcriptional activation. , 1998, Genes & development.

[38]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[39]  Ron Shamir,et al.  Integrative analysis of genome-wide experiments in the context of a large high-throughput data compendium , 2005, Molecular systems biology.

[40]  Ozlem Keskin,et al.  Architectures and functional coverage of protein-protein interfaces. , 2008, Journal of molecular biology.

[41]  Takayuki Furumatsu,et al.  Smad3 Induces Chondrogenesis through the Activation of SOX9 via CREB-binding Protein/p300 Recruitment*[boxs] , 2005, Journal of Biological Chemistry.

[42]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[43]  Sarah Barber,et al.  A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[44]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[45]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .