Semantic Annotation of Biomedical Literature Using Google

With the increasing amount of biomedical literature, there is a need for automatic extraction of information to support biomedical researchers. Due to incomplete biomedical information databases, the extraction is not straightforward using dictionaries, and several approaches using contextual rules and machine learning have previously been proposed. Our work is inspired by the previous approaches, but is novel in the sense that it is using Google for semantic annotation of the biomedical words. The semantic annotation accuracy obtained – 52% on words not found in the Brown Corpus, Swiss-Prot or LocusLink (accessed using Gsearch.org) – is justifying further work in this direction.

[1]  Udo Kruschwitz Automatic acquired domain knowledge for ad hoc search: evaluated results , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[2]  L. Wagner,et al.  21. UniGene: A Unified View of the Transcriptome , 2003 .

[3]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[5]  Rune Sætre,et al.  ProtChew: Automatic Extraction of Protein Names from Biomedical Literature , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[6]  Limsoon Wong,et al.  Natural Language Processing and Information Extraction in Biology , 2000 .

[7]  Limsoon Wong,et al.  PIES, A Protein Interaction Extraction System , 2000, Pacific Symposium on Biocomputing.

[8]  Paul Montague,et al.  Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32 , 2004 .

[9]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[10]  Steffen Staab,et al.  Learning by googling , 2004, SKDD.

[11]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[12]  Timothy W. Finin,et al.  Information retrieval on the semantic web , 2002, CIKM '02.

[13]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[14]  Sougata Mukherjea,et al.  Enhancing a biomedical information extraction system with dictionary mining and context disambiguation , 2004, IBM J. Res. Dev..

[15]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[16]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[17]  Hong Yu,et al.  Automatically identifying gene/protein terms in MEDLINE abstracts , 2002, J. Biomed. Informatics.

[18]  Vinay Kakade Improving the Precision of Web Search for Medical Domain using Automatic Query Expansion , .

[19]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[20]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[21]  Manabu Torii,et al.  Using Unlabeled MEDLINE Abstracts for Biological Named Entity Classification , 2002 .

[22]  Ulf Leser,et al.  A Support Vector Machine Classifier for Gene Name Recognition , 2004 .

[23]  Jo McEntyre,et al.  The NCBI Handbook , 2002 .

[24]  Carol Friedman,et al.  Automatic extraction of gene and protein synonyms from MEDLINE and journal articles , 2002, AMIA.

[25]  Rohit J. Kate,et al.  Learning to Extract Proteins and their Interactions from Medline Abstracts , 2003 .

[26]  Limsoon Wong Gaps in text-based knowledge discovery for biology. , 2002, Drug discovery today.

[27]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[28]  Raymond J. Mooney,et al.  Extracting gene and protein names from biomedical abstracts , 2002 .

[29]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[30]  Tapio Salakoski,et al.  New Techniques for Disambiguation in Natural Language and Their Application to Biological Text , 2004, J. Mach. Learn. Res..

[31]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[32]  SystemLimsoon Wong,et al.  A Protein Interaction Extraction System , 2001 .

[33]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[34]  David Parry,et al.  A fuzzy ontology for medical document retrieval , 2004, ACSW.

[35]  Jun'ichi Tsujii,et al.  Probabilistic term variant generator for biomedical terms , 2003, SIGIR.

[36]  Limsoon Wong,et al.  Natural Language Processing for Biology - Session Introduction , 2001, Pacific Symposium on Biocomputing.