Normalization of Gene/Protein Names in Biological Literatures using Vector-Space Model

As the number of biological literatures grows exponentially, needs for text mining system are increased. In text mining area, normalization is mapping gene/protein names to a database. It is necessary to combine extracted information from various literatures and to curate a database or an ontology using literatures. Previous normalization researches used direct comparison methods between a database and literatures, but it is weak to extremely variational gene/protein names in literatures. Therefore, in this paper, we propose a normalization method using vector-space model. For each gene/protein name, we rank identifiers using vector-space model, and find the most similar identifier with the name. Experimental result shows the proposed method has 70.7% f-measure.

[1]  Ben Wellner Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data , 2005, LBLODMBS@IDMB.

[2]  R. Zimmer,et al.  ProMiner: Organism-specific protein name detection using approximate string matching , 2004 .

[3]  Aaron Cohen Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries , 2005, LBLODMBS@IDMB.

[4]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[7]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[8]  Luca Bernardi,et al.  Mining Information for Functional Genomics , 2002, IEEE Intell. Syst..

[9]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[10]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[11]  Ralf Zimmer,et al.  A simple approach for protein name identification: prospects and limits , 2005, BMC Bioinformatics.

[12]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[13]  Alexander A. Morgan,et al.  Evaluating the Automatic Mapping of Human Gene and Protein Mentions to Unique Identifiers , 2006, Pacific Symposium on Biocomputing.