An en masse phenotype and function prediction system for Mus musculus

Background:Individual researchers are struggling to keep up with the accelerating emergence of high-throughput biological data, and to extract information that relates to their specific questions. Integration of accumulated evidence should permit researchers to form fewer - and more accurate - hypotheses for further study through experimentation.Results:Here a method previously used to predict Gene Ontology (GO) terms for Saccharomyces cerevisiae (Tian et al.: Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biol 2008, 9(Suppl 1):S7) is applied to predict GO terms and phenotypes for 21,603 Mus musculus genes, using a diverse collection of integrated data sources (including expression, interaction, and sequence-based data). This combined 'guilt-by-profiling' and 'guilt-by-association' approach optimizes the combination of two inference methodologies. Predictions at all levels of confidence are evaluated by examining genes not used in training, and top predictions are examined manually using available literature and knowledge base resources.Conclusion:We assigned a confidence score to each gene/term combination. The results provided high prediction performance, with nearly every GO term achieving greater than 40% precision at 1% recall. Among the 36 novel predictions for GO terms and 40 for phenotypes that were studied manually, >80% and >40%, respectively, were identified as accurate. We also illustrate that a combination of 'guilt-by-profiling' and 'guilt-by-association' outperforms either approach alone in their application to M. musculus.

[1]  K. Starke,et al.  Modulation of the baroreceptor reflex by α2A‐adrenoceptors: a study in α2A knockout mice , 2004 .

[2]  Cathy H. Wu,et al.  InterPro, progress and status in 2005 , 2004, Nucleic Acids Res..

[3]  Linda Rothstein Predictions , 1976, Nursing mirror and midwives journal.

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  T. Mikawa,et al.  The carboxyl terminus of myosin binding protein C (MyBP-C, C-protein) specifies incorporation into the A-band of striated muscle. , 1996, Journal of cell science.

[6]  S. L. Wong,et al.  Combining biological networks to predict genetic interactions. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[8]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[9]  Erik L. L. Sonnhammer,et al.  Inparanoid: a comprehensive database of eukaryotic orthologs , 2004, Nucleic Acids Res..

[10]  T. Joshi,et al.  Genome-scale gene function prediction using multiple sources of high-throughput data in yeast Saccharomyces cerevisiae. , 2004, Omics : a journal of integrative biology.

[11]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[12]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[13]  V. Papaioannou,et al.  Loss of Tbx4 blocks hindlimb development and affects vascularization and fusion of the allantois , 2003, Development.

[14]  Robin Sibson,et al.  The Construction of Hierarchic and Non-Hierarchic Classifications , 1968, Comput. J..

[15]  Won Y. Kim,et al.  Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining , 2004, KDD.

[16]  Madeline A. Crosby,et al.  FlyBase: genomes by the dozen , 2006, Nucleic Acids Res..

[17]  Kimberly Van Auken,et al.  WormBase: new content and better access , 2006, Nucleic Acids Res..

[18]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[19]  Igor Jurisica,et al.  Online Predicted Human Interaction Database , 2005, Bioinform..

[20]  Michael I. Jordan,et al.  A critical assessment of Mus musculus gene function prediction using integrated genomic evidence , 2008, Genome Biology.

[21]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[22]  S. Kasif,et al.  Whole-genome annotation by using evidence integration in functional-linkage networks. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  K. Starke,et al.  Modulation of the baroreceptor reflex by alpha 2A-adrenoceptors: a study in alpha 2A knockout mice. , 2004, British journal of pharmacology.

[24]  S. Batalov,et al.  A gene atlas of the mouse and human protein-encoding transcriptomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  H. Müller-Hermelink,et al.  Retarded thymic involution and massive germinal center formation in NF‐ATp‐deficient mice , 1998, European journal of immunology.

[26]  Weidong Tian,et al.  Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function , 2008, Genome Biology.

[27]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[28]  X. Bustelo,et al.  Loss of Vav2 proto-oncogene causes tachycardia and cardiovascular disease in mice. , 2007, Molecular biology of the cell.

[29]  D. Koller,et al.  InSite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scale , 2007, Genome Biology.

[30]  Ting Chen,et al.  An integrated probabilistic model for functional prediction of proteins , 2003, RECOMB '03.

[31]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Kara Dolinski,et al.  Expanded protein information at SGD: new pages and proteome browser , 2006, Nucleic Acids Res..

[33]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[34]  Sarah Barber,et al.  A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Jun S. Liu,et al.  Clustering analysis of SAGE data using a Poisson approach , 2004, Genome Biology.

[36]  A. Ben-Ze'ev,et al.  Transient induction of vinculin gene expression in 3T3 fibroblasts stimulated by serum-growth factors. , 1990, Cell regulation.

[37]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[38]  P. Donahoe,et al.  The type I activin receptor ActRIB is required for egg cylinder organization and gastrulation in the mouse. , 1998, Genes & development.

[39]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[40]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  D. Russell,et al.  The parturition defect in steroid 5alpha-reductase type 1 knockout mice is due to impaired cervical ripening. , 1999, Molecular endocrinology.

[43]  Judith A. Blake,et al.  The mouse genome database (MGD): new features facilitating a model system , 2006, Nucleic Acids Res..

[44]  Frederick P. Roth,et al.  Predicting phenotype from patterns of annotation , 2003, ISMB.