OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature

BackgroundSingle Nucleotide Polymorphisms, among other type of sequence variants, constitute key elements in genetic epidemiology and pharmacogenomics. While sequence data about genetic variation is found at databases such as dbSNP, clues about the functional and phenotypic consequences of the variations are generally found in biomedical literature. The identification of the relevant documents and the extraction of the information from them are hampered by the large size of literature databases and the lack of widely accepted standard notation for biomedical entities. Thus, automatic systems for the identification of citations of allelic variants of genes in biomedical texts are required.ResultsOur group has previously reported the development of OSIRIS, a system aimed at the retrieval of literature about allelic variants of genes http://ibi.imim.es/osirisform.html. Here we describe the development of a new version of OSIRIS (OSIRISv1.2, http://ibi.imim.es/OSIRISv1.2.html) which incorporates a new entity recognition module and is built on top of a local mirror of the MEDLINE collection and HgenetInfoDB: a database that collects data on human gene sequence variations. The new entity recognition module is based on a pattern-based search algorithm for the identification of variation terms in the texts and their mapping to dbSNP identifiers. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in 99% precision, 82% recall, and an F-score of 0.89. As an example, the application of the system for collecting literature citations for the allelic variants of genes related to the diseases intracranial aneurysm and breast cancer is presented.ConclusionOSIRISv1.2 can be used to link literature references to dbSNP database entries with high accuracy, and therefore is suitable for collecting current knowledge on gene sequence variations and supporting the functional annotation of variation databases. The application of OSIRISv1.2 in combination with controlled vocabularies like MeSH provides a way to identify associations of biomedical interest, such as those that relate SNPs with diseases.

[1]  Russ B. Altman,et al.  GAPSCORE: finding gene and protein names one word at a time , 2004, Bioinform..

[2]  Michele R. Tennant,et al.  Entrez Gene , 2007 .

[3]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[4]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[5]  Vini G Khurana,et al.  The presence of tandem endothelial nitric oxide synthase gene polymorphisms identifying brain aneurysms more prone to rupture. , 2005, Journal of neurosurgery.

[6]  Ituro Inoue,et al.  Influence of endothelial nitric oxide synthase T-786C single nucleotide polymorphism on aneurysm size. , 2005, Journal of neurosurgery.

[7]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[8]  Juan P Casas,et al.  Endothelial nitric oxide synthase gene polymorphisms and cardiovascular disease: a HuGE review. , 2006, American journal of epidemiology.

[9]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[10]  J-D Kim,et al.  Corpora and their Annotation , 2006 .

[11]  Vini G Khurana,et al.  Endothelial Nitric Oxide Synthase Gene Polymorphisms Predict Susceptibility to Aneurysmal Subarachnoid Hemorrhage and Cerebral Vasospasm , 2004, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[12]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[13]  Jeffrey B. Colombe,et al.  Finding relevant references to genes and proteins in Medline using a Bayesian approach , 2002, Bioinform..

[14]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[15]  G. Casari,et al.  Automatic extraction of mutations from Medline and cross-validation with OMIM. , 2004, Nucleic acids research.

[16]  W. Schievink,et al.  Genetics of intracranial aneurysms. , 1997, Neurosurgery.

[17]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[18]  G. Rinkel,et al.  Genetics of Intracranial Aneurysms , 2008, Stroke.

[19]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[20]  Burkhard Rost,et al.  Protein names precisely peeled off free text , 2004, ISMB/ECCB.

[21]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[22]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[23]  Juliane Fluck,et al.  ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries , 2007 .

[24]  H Hunt Batjer,et al.  Deficiencies in estrogen-mediated regulation of cerebrovascular homeostasis may contribute to an increased risk of cerebral aneurysm pathogenesis and rupture in menopausal and postmenopausal women. , 2006, Medical hypotheses.

[25]  Yang Jin,et al.  An entity tagger for recognizing acquired genomic variations in cancer literature , 2004, Bioinform..

[26]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[27]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[28]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[29]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[30]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[31]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[32]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[33]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[34]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[35]  Laura Inés Furlong,et al.  OSIRIS: a tool for retrieving literature about sequence variants , 2006, Bioinform..

[36]  Vini G Khurana,et al.  Endothelial Nitric Oxide Synthase T-786C Single Nucleotide Polymorphism: A Putative Genetic Marker Differentiating Small Versus Large Ruptured Intracranial Aneurysms , 2003, Stroke.

[37]  A. Algra,et al.  Prevalence and risk of rupture of intracranial aneurysms: a systematic review. , 1998, Stroke.

[38]  Ralf Zimmer,et al.  A simple approach for protein name identification: prospects and limits , 2005, BMC Bioinformatics.