A semantic analysis of the annotations of the human genome

The correct interpretation of any biological experiment depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are ubiquitous and used by all life scientists in most experiments. However, it is well known that such databases are incomplete and many annotations may also be incorrect. In this paper we describe a technique that can be used to analyze the semantic content of such annotation databases. Our approach is able to extract implicit semantic relationships between genes and functions. This ability allows us to discover novel functions for known genes. This approach is able to identify missing and inaccurate annotations in existing annotation databases, and thus help improve their accuracy. We used our technique to analyze the current annotations of the human genome. From this body of annotations, we were able to predict 212 additional gene-function assignments. A subsequent literature search found that 138 of these gene-functions assignments are supported by existing peer-reviewed papers. An additional 23 assignments have been confirmed in the meantime by the addition of the respective annotations in later releases of the Gene Ontology database. Overall, the 161 confirmed assignments represent 75.95% of the proposed gene-function assignments. Only one of our predictions (0.4%) was contradicted by the existing literature. We could not find any relevant articles for 50 of our predictions (23.58%). The method is independent of the organism and can be used to analyze and improve the quality of the data of any public or private annotation database.

[1]  W. Pan,et al.  Model-based cluster analysis of microarray gene-expression data , 2002, Genome Biology.

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  E. Sprinzak,et al.  Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. , 1999, Genome research.

[5]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[6]  May D. Wang,et al.  GoMiner: a resource for biological interpretation of genomic and proteomic data , 2003, Genome Biology.

[7]  Patricia De la Vega,et al.  Discovery of Gene Function by Expression Profiling of the Malaria Parasite Life Cycle , 2003, Science.

[8]  M. Maines,et al.  Human heme oxygenase-2: characterization and expression of a full-length cDNA and evidence suggesting that the two HO-2 transcripts may differ by choice of polyadenylation signal. , 1992, Archives of biochemistry and biophysics.

[9]  Jun Sakakibara,et al.  SREBP-2 and NF-Y are involved in the transcriptional regulation of squalene epoxidase. , 2002, Biochemical and biophysical research communications.

[10]  P. Khatri,et al.  Global functional profiling of gene expression ? ? This work was funded in part by a Sun Microsystem , 2003 .

[11]  Bing Zhang,et al.  GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies , 2004, BMC Bioinformatics.

[12]  Olivier Bodenreider,et al.  Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[13]  Purvesh Khatri,et al.  Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate , 2003, Nucleic Acids Res..

[14]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[15]  F. Uchiumi,et al.  Replication factor C recognizes 5'-phosphate ends of telomeres. , 1996, Biochemical and biophysical research communications.

[16]  Andrian Marcus,et al.  Source Viewer 3D (sv3D) - a framework for software visualization , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[17]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[18]  P. Khatri,et al.  Profiling gene expression using onto-express. , 2002, Genomics.

[19]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.

[20]  John Quackenbush Microarrays--Guilt by Association , 2003, Science.

[21]  Lani F. Wu,et al.  Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters , 2002, Nature Genetics.

[22]  S. Dwight,et al.  Predicting gene function from patterns of annotation. , 2003, Genome research.

[23]  David M. Rocke,et al.  Transformation and normalization of oligonucleotide microarray data , 2003, Bioinform..

[24]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Gene H. Golub,et al.  Matrix computations , 1983 .

[26]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  J. Skolnick,et al.  From genes to protein structure and function: novel applications of computational approaches in the genomic era. , 2000, Trends in biotechnology.

[28]  G. Pruijn,et al.  hPop4: a new protein subunit of the human RNase MRP and RNase P ribonucleoprotein complexes. , 1999, Nucleic acids research.

[29]  Purvesh Khatri,et al.  Onto-Tools: an ensemble of web-accessible, ontology-based tools for the functional design and interpretation of high-throughput gene expression experiments , 2004, Nucleic Acids Res..

[30]  P. Khatri,et al.  Global functional profiling of gene expression. , 2003, Genomics.

[31]  Joaquín Dopazo,et al.  Gene expression data preprocessing , 2003, Bioinform..

[32]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[33]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[34]  Jan Komorowski,et al.  Predicting Gene Function from Gene Expressions and Ontologies , 2000, Pacific Symposium on Biocomputing.

[35]  Joaquín Dopazo,et al.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes , 2004, Bioinform..

[36]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[37]  J. Skolnick,et al.  Access the most recent version at doi: 10.1110/ps.49201 References , 2000 .