Word Sense Disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering

With more and more genomes being sequenced, a lot of effort is devoted to their annotation with terms from controlled vocabularies such as the GeneOntology. Manual annotation based on relevant literature is tedious, but automation of this process is difficult. One particularly challenging problem is word sense disambiguation. Terms such as 'development' can refer to developmental biology or to the more general sense. Here, we present two approaches to address this problem by using term co-occurrences and document clustering. To evaluate our method we defined a corpus of 331 documents on development and developmental biology. Term co-occurrence analysis achieves an F-measure of 77%. Additionally, applying document clustering improves precision to 82%. We applied the same approach to disambiguate 'nucleus', 'transport', and 'spindle', and we achieved consistent results. Thus, our method is a viable approach towards the automation of literature-based genome annotation.

[1]  James Pustejovsky,et al.  Biomedical term mapping databases , 2004, Nucleic Acids Res..

[2]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[3]  Aldo Gangemi,et al.  Ontology Learning and Its Application to Automated Terminology Translation , 2003, IEEE Intell. Syst..

[4]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications , 2007 .

[5]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[6]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Ted Pedersen,et al.  Distinguishing Word Senses in Untagged Text , 1997, EMNLP.

[8]  Xiaogang Wang,et al.  Hierarchical Density-Based Clustering of Categorical Data and a Simplification , 2007, PAKDD.

[9]  Rada Mihalcea,et al.  Co-training and Self-training for Word Sense Disambiguation , 2004, CoNLL.

[10]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Hinrich Schütze,et al.  Information retrieval based on word senses , 1995 .

[12]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[13]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[14]  Hongfang Liu,et al.  Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues , 2006, BMC Bioinformatics.

[15]  Yorick Wilks,et al.  The Interaction of Knowledge Sources in Word Sense Disambiguation , 2001, CL.

[16]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[17]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[18]  Dominic Widdows,et al.  Discovering Corpus-Specific Word Senses , 2003, EACL.

[19]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[20]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[21]  Xiaogang Wang,et al.  Clustering by common friends finds locally significant proteins mediating modules , 2007, Bioinform..

[22]  Xiaogang Wang,et al.  Clustering large software systems at multiple layers , 2007, Inf. Softw. Technol..

[23]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[24]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[25]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[26]  Patrick Ruch,et al.  Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot , 2005, BMC Bioinformatics.

[27]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006 .

[28]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[29]  Tapio Salakoski,et al.  New Techniques for Disambiguation in Natural Language and Their Application to Biological Text , 2004, J. Mach. Learn. Res..

[30]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[31]  T. Salakoski,et al.  Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation , 2005, BMC Bioinformatics.

[32]  Ted Pedersen,et al.  Knowledge Lean Word-Sense Disambiguation , 1997, AAAI/IAAI.

[33]  Martijn J. Schuemie,et al.  Thesaurus-based disambiguation of gene symbols , 2005, BMC Bioinformatics.

[34]  Eneko Agirre,et al.  Word Sense Disambiguation: Algorithms and Applications (Text, Speech and Language Technology) , 2006 .

[35]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[36]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[37]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[38]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[39]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[40]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.