Improving the Inter-Corpora Compatibility for protein Annotations

Although there are several corpora with protein annotation, incompatibility between the annotations in different corpora remains a problem that hinders the progress of automatic recognition of protein names in biomedical literature. Here, we report on our efforts to find a solution to the incompatibility issue, and to improve the compatibility between two representative protein-annotated corpora: the GENIA corpus and the GENETAG corpus. In a comparative study, we improve our insight into the two corpora, and a series of experimental results show that most of the incompatibility can be removed.

[1]  Yue Wang,et al.  Incorporating GENETAG-style annotation to GENIA corpus , 2009, BioNLP@HLT-NAACL.

[2]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[3]  Limsoon Wong,et al.  Accomplishments and Challenges in Bioinformatics , 2004, IT Prof..

[4]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[5]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[6]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[7]  Shih-Hung Wu,et al.  Various criteria in the evaluation of biomedical named entity recognition , 2006, BMC Bioinformatics.

[8]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[9]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[10]  Adrian J. Shepherd,et al.  Protein name tagging in the immunological domain , 2008 .

[11]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[12]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[13]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[14]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[15]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[16]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..