A text-mining analysis of the human phenome

A number of large-scale efforts are underway to define the relationships between genes and proteins in various species. But, few attempts have been made to systematically classify all such relationships at the phenotype level. Also, it is unknown whether such a phenotype map would carry biologically meaningful information. We have used text mining to classify over 5000 human phenotypes contained in the Online Mendelian Inheritance in Man database. We find that similarity between phenotypes reflects biological modules of interacting functionally related genes. These similarities are positively correlated with a number of measures of gene function, including relatedness at the level of protein sequence, protein motifs, functional annotation, and direct protein–protein interaction. Phenotype grouping reflects the modular nature of human disease genetics. Thus, phenotype mapping may be used to predict candidate genes for diseases as well as functional relations between genes and proteins. Such predictions will further improve if a unified system of phenotype descriptors is developed. The phenotype similarity data are accessible through a web interface at http://www.cmbi.ru.nl/MimMiner/.

[1]  R. A. Fisher,et al.  Statistical Tables for Biological, Agricultural and Medical Research , 1956 .

[2]  L. Tippett Statistical Tables: For Biological, Agricultural and Medical Research , 1954 .

[3]  R. Scully,et al.  Case records of the Massachusetts General Hospital. Weekly clinicopathological exercises. Case 46-1967. , 1967, The New England journal of medicine.

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  R. Scully,et al.  Case records of the Massachusetts General Hospital. , 1990 .

[6]  A. Rashid,et al.  Case 9-1995 , 1995 .

[7]  Y Yang,et al.  An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts , 1996, Comput. Biol. Medicine.

[8]  G. Germino,et al.  PKD1 interacts with PKD2 through a probable coiled-coil domain , 1997, Nature Genetics.

[9]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  F. Couch,et al.  Stable interaction between the products of the BRCA1 and BRCA2 tumor suppressor genes in mitotic and meiotic cells. , 1998, Molecular cell.

[11]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[12]  R. Snell,et al.  Interaction between hamartin and tuberin, the TSC1 and TSC2 gene products. , 1998, Human molecular genetics.

[13]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[14]  J. Kakisis,et al.  Coexistence of hypertrophic cardiomyopathy and fibromuscular dysplasia of the superior mesenteric artery. , 2001, The New England journal of medicine.

[15]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[16]  David Valle,et al.  Human disease genes , 2001, Nature.

[17]  Kim C. Worley,et al.  A computational/functional genomics approach for the enrichment of the retinal transcriptome and the identification of positional candidate retinopathy genes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[18]  R. Mullins,et al.  Cellular control of actin nucleation. , 2002, Annual review of cell and developmental biology.

[19]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[20]  Ronald W. Davis,et al.  Functional profiling of the Saccharomyces cerevisiae genome , 2002, Nature.

[21]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[22]  A. D’Andrea,et al.  The Fanconi anaemia/BRCA pathway , 2003, Nature Reviews Cancer.

[23]  C. Sabatti,et al.  The Human Phenome Project , 2003, Nature Genetics.

[24]  M. Rosen,et al.  Contingent phosphorylation/dephosphorylation provides a mechanism of molecular memory in WASP. , 2003, Molecular cell.

[25]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[26]  E Birney,et al.  The Genome Knowledgebase: a resource for biologists and bioinformaticists. , 2003, Cold Spring Harbor symposia on quantitative biology.

[27]  A. Read,et al.  How clinicians add to knowledge of development , 2003, The Lancet.

[28]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[30]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[31]  H. Brunner,et al.  From syndrome families to functional genomics , 2004, Nature Reviews Genetics.

[32]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[33]  The Mouse Phenotype Database Integration Consortium,et al.  The European dimension for the mouse genome mutagenesis program , 2004, Nature Genetics.

[34]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[35]  N. Perrimon,et al.  Genome-Wide RNAi Analysis of Growth and Viability in Drosophila Cells , 2004, Science.

[36]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.

[37]  David E Hill,et al.  Toward improving Caenorhabditis elegans phenome mapping with an ORFeome-based RNAi library. , 2004, Genome research.

[38]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[39]  L. Biesecker,et al.  Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations , 2005, Clinical genetics.

[40]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.