Annotating Genes Using Textual Patterns

Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN, a system to automatically infer new GO annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. GEANN (i) extracts from text significant terms and phrases associated with a GO term, (ii) based on the extracted terms, constructs textual extraction patterns with reliability scores for GO terms, (iii) expands the pattern set through "pattern crosswalks", (iv) employs semantic pattern matching, rather than syntactic pattern matching, which allows for the recognition of phrases with close meanings, and (iv) annotates genes based on the "quality" of the matched pattern to the genomic entity occurring in the text. On the average, in our experiments, GEANN has reached to the precision level of 78% at the 57% recall level.

[1]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[2]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[3]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[4]  K. Bretonnel Cohen,et al.  The Compositional Structure of Gene Ontology Terms , 2003, Pacific Symposium on Biocomputing.

[5]  N. Sonenberg,et al.  A new translational regulator with homology to eukaryotic translation initiation factor 4G , 1997, The EMBO journal.

[6]  Eisaku Maeda,et al.  Assigning gene ontology categories (GO) to yeast genes using text-based supervised learning methods , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[7]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[8]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[9]  Gideon S. Mann Fine-Grained Proper Noun Ontologies for Question Answering , 2002, COLING 2002.

[10]  Eisaku Maeda,et al.  Assigning gene ontology categories (GO) to yeast genes using text-based supervised learning methods , 2004 .

[11]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[12]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[13]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[14]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[15]  C. Fellbaum An Electronic Lexical Database , 1998 .

[16]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[17]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[18]  Gerald Salton,et al.  Automatic text processing , 1988 .

[19]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.