Data and text mining nala : text mining natural language mutation mentions

Motivation: The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). Results: We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. Availability and Implementation: Source code, API and corpora freely available at: http://tagtog. net/-corpora/IDP4þ. Contact: nala@rostlab.org Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Ulf Leser,et al.  SETH detects and normalizes genetic variants in text , 2016, Bioinform..

[2]  Raymond Dalgleish,et al.  HGVS Recommendations for the Description of Sequence Variants: 2016 Update , 2016, Human mutation.

[3]  Raja Mazumder,et al.  DiMeX: A Text Mining System for Mutation-Disease Association Extraction , 2016, PloS one.

[4]  Hongfang Liu,et al.  Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature , 2015, BMC Bioinformatics.

[5]  Hye-Jeong Song,et al.  Named Entity Recognition using Word Embedding as a Feature , 2016 .

[6]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[7]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[8]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[9]  Reinhard Schneider,et al.  The HIV Mutation Browser: A Resource for Human Immunodeficiency Virus Mutagenesis and Polymorphism Data , 2014, PLoS Comput. Biol..

[10]  Wanxiang Che,et al.  Revisiting Embedding Features for Simple Semi-supervised Learning , 2014, EMNLP.

[11]  Andrew McCallum,et al.  Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[12]  Burkhard Rost,et al.  tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles , 2014, Database J. Biol. Databases Curation.

[13]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.

[14]  Karin M. Verspoor,et al.  Literature mining of genetic variants for curation: quantifying the importance of supplementary material , 2014, Database J. Biol. Databases Curation.

[15]  Karin M. Verspoor,et al.  Open Peer Review Invited Referee Responses , 2022 .

[16]  Philip C. Biggin,et al.  Mutationmapper: A Tool to Aid the Mapping of Protein Mutation Data , 2013, PloS one.

[17]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[18]  Karin M. Verspoor,et al.  Annotating the biomedical literature for the human variome , 2013, Database J. Biol. Databases Curation.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Karin M. Verspoor,et al.  Literature mining of protein-residue associations with graph rules learned through distant supervision , 2012, J. Biomed. Semant..

[21]  Dietrich Rebholz-Schuhmann,et al.  Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb , 2009, BMC Bioinformatics.

[22]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[23]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[24]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[25]  John Parsch,et al.  INAUGURAL ARTICLE by a Recently Elected Academy Member:Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila , 2007 .

[26]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[27]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[28]  P. Stenson,et al.  Human Gene Mutation Database (HGMD , 2003 .

[29]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[30]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[31]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.