One of the foundational text-mining tasks in the biomedical domain is the identification of genes and protein names in journal papers. However, the ambiguous nature of gene names means that the performance of information management tasks such as query-based retrieval will suffer if gene name mentions are not explicitly mapped back to a unique identifier in order to resolve issues relating to synonymy (i.e. many different lexical forms representing the same gene) and ambiguity (i.e. many distinct genes sharing the same lexical form). This task is called gene name normalisation, and was recently investigated at the BioCreative Challenge (Hirschman et al., 2004b), a text-mining evaluation forum focusing on core biomedical text processing tasks. In this work, we present a machine learning approach to gene normalisation based on work by Crim et al. (2005). We compare this system with a number of simple dictionary lookup-based methods. We also investigate a number of novel features not used by Crim et al. (2005). Our results show that it is difficult to improve upon the original set of features used by Crim et al. We also show that for some organisims gene name normalisation can be successfully performed using simple dictionary lookup techniques.
[1]
Alexander A. Morgan,et al.
Overview of BioCreAtIvE task 1B: normalized gene lists
,
2005,
BMC Bioinformatics.
[2]
Alfonso Valencia,et al.
Overview of BioCreAtIvE: critical assessment of information extraction for biology
,
2005,
BMC Bioinformatics.
[3]
Fernando Pereira,et al.
Identifying gene and protein mentions in text using conditional random fields
,
2005,
BMC Bioinformatics.
[4]
Fernando Pereira,et al.
Automatically annotating documents with normalized gene lists
,
2005,
BMC Bioinformatics.
[5]
Tatiana A. Tatusova,et al.
Entrez Gene: gene-centered information at NCBI
,
2004,
Nucleic Acids Res..