Mutation extraction tools can be combined for robust recognition of genetic variants in the literature

As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, individually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any individual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.

[1]  Karin M. Verspoor,et al.  Detection of Protein Catalytic Sites in the Biomedical Literature , 2013, Pacific Symposium on Biocomputing.

[2]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[3]  Sherri de Coronado,et al.  NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information , 2007, J. Biomed. Informatics.

[4]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[5]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[6]  A. Yepes Towards automatic large-scale curation of genomic variation: improving coverage based on supplementary material , 2013 .

[7]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[8]  André Blavier,et al.  A formalized description of the standard human variant nomenclature in Extended Backus-Naur Form , 2011, BMC Bioinformatics.

[9]  Alexander V. Diemand,et al.  The Swiss‐Prot variant page and the ModSNP database: A resource for sequence and structure information on human protein variants , 2004, Human mutation.

[10]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[11]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[12]  Olivier Bodenreider,et al.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature , 2011, Bioinform..

[13]  Laura Inés Furlong,et al.  Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers , 2011, BMC Bioinformatics.

[14]  Olivier Bodenreider,et al.  A mutation-centric approach to identifying pharmacogenomic relations in text , 2012, J. Biomed. Informatics.

[15]  Chitta Baral,et al.  A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions , 2012, J. Biomed. Informatics.

[16]  M. Vihinen,et al.  KinMutBase: A registry of disease‐causing mutations in protein kinase domains , 2005, Human mutation.

[17]  Karin M. Verspoor,et al.  Literature mining of genetic variants for curation: quantifying the importance of supplementary material , 2014, Database J. Biol. Databases Curation.

[18]  M. Woods,et al.  The InSiGHT database: utilizing 100 years of insights into Lynch Syndrome , 2013, Familial Cancer.

[19]  Fan Meng,et al.  Medline search engine for finding genetic markers with biological significance , 2007, Bioinform..

[20]  Laura Inés Furlong,et al.  OSIRISv1.2: A named entity recognition system for sequence variants of genes in biomedical literature , 2008, BMC Bioinformatics.

[21]  René Witte,et al.  Mutation Mining—A Prospector's Tale , 2006, Inf. Syst. Frontiers.

[22]  M. Schenck,et al.  Extraction of Genetic Mutations Associated with Cancer from Public Literature , 2012 .

[23]  Andrew C. R. Martin,et al.  Human Mutation , 2020 .

[24]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[25]  Karin M. Verspoor,et al.  Literature mining of protein-residue associations with graph rules learned through distant supervision , 2012, J. Biomed. Semant..

[26]  Ourania Horaitis,et al.  Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. , 2002, Genome research.

[27]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[28]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[29]  Nona Naderi,et al.  Automated extraction and semantic analysis of mutation impacts from the biomedical literature , 2012, BMC Genomics.

[30]  Dietrich Rebholz-Schuhmann,et al.  Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb , 2009, BMC Bioinformatics.

[31]  Antonio Jimeno-Yepes,et al.  GeneRIF indexing: sentence selection based on machine learning , 2013, BMC Bioinformatics.

[32]  C. Cole,et al.  Mining cancer genomes in COSMIC , 2012, BMC Proceedings.

[33]  M. Stratton,et al.  The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website , 2004, British Journal of Cancer.

[34]  S. Antonarakis,et al.  Mutation nomenclature extensions and suggestions to describe complex mutations: A discussion , 2000 .

[35]  Alfonso Valencia,et al.  Extraction of human kinase mutations from literature, databases and genotyping studies , 2009, BMC Bioinformatics.

[36]  Karin M. Verspoor,et al.  Annotating the biomedical literature for the human variome , 2013, Database J. Biol. Databases Curation.