Literature mining of genetic variants for curation: quantifying the importance of supplementary material

A major focus of modern biological research is the understanding of how genomic variation relates to disease. Although there are significant ongoing efforts to capture this understanding in curated resources, much of the information remains locked in unstructured sources, in particular, the scientific literature. Thus, there have been several text mining systems developed to target extraction of mutations and other genetic variation from the literature. We have performed the first study of the use of text mining for the recovery of genetic variants curated directly from the literature. We consider two curated databases, COSMIC (Catalogue Of Somatic Mutations In Cancer) and InSiGHT (International Society for Gastro-intestinal Hereditary Tumours), that contain explicit links to the source literature for each included mutation. Our analysis shows that the recall of the mutations catalogued in the databases using a text mining tool is very low, despite the well-established good performance of the tool and even when the full text of the associated article is available for processing. We demonstrate that this discrepancy can be explained by considering the supplementary material linked to the published articles, not previously considered by text mining tools. Although it is anecdotally known that supplementary material contains ‘all of the information’, and some researchers have speculated about the role of supplementary material (Schenck et al. Extraction of genetic mutations associated with cancer from public literature. J Health Med Inform 2012;S2:2.), our analysis substantiates the significant extent to which this material is critical. Our results highlight the need for literature mining tools to consider not only the narrative content of a publication but also the full set of material related to a publication.

[1]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[2]  Karin M. Verspoor,et al.  Detection of Protein Catalytic Sites in the Biomedical Literature , 2013, Pacific Symposium on Biocomputing.

[3]  S. Antonarakis,et al.  Corrigendum: Mutation nomenclature extensions and suggestions to describe complex mutations: A discussion , 2002, Human mutation.

[4]  Karin M. Verspoor,et al.  Annotating the biomedical literature for the human variome , 2013, Database J. Biol. Databases Curation.

[5]  Chitta Baral,et al.  A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions , 2012, J. Biomed. Informatics.

[6]  Rob W.W. Hooft,et al.  The value of data , 2011, Nature Genetics.

[7]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[8]  H T Lynch,et al.  Review of the Lynch syndrome: history, molecular genetics, screening, differential diagnosis, and medicolegal ramifications , 2009, Clinical genetics.

[9]  Ourania Horaitis,et al.  Time for a unified system of mutation description and reporting: a review of locus-specific mutation databases. , 2002, Genome research.

[10]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[11]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[12]  M. Vihinen,et al.  KinMutBase: A registry of disease‐causing mutations in protein kinase domains , 2005, Human mutation.

[13]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[14]  Alexander V. Diemand,et al.  The Swiss‐Prot variant page and the ModSNP database: A resource for sequence and structure information on human protein variants , 2004, Human mutation.

[15]  Li Gong,et al.  PharmGKB: An Integrated Resource of Pharmacogenomic Data and Knowledge , 2008, Current protocols in bioinformatics.

[16]  Alfonso Valencia,et al.  Extraction of human kinase mutations from literature, databases and genotyping studies , 2009, BMC Bioinformatics.

[17]  K. Bretonnel Cohen,et al.  Mining the pharmacogenomics literature - a survey of the state of the art , 2012, Briefings Bioinform..

[18]  M. Stratton,et al.  The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website , 2004, British Journal of Cancer.

[19]  Olivier Bodenreider,et al.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature , 2011, Bioinform..

[20]  Fan Meng,et al.  Medline search engine for finding genetic markers with biological significance , 2007, Bioinform..

[21]  Andrew C. R. Martin,et al.  Human Mutation , 2020 .

[22]  Nona Naderi,et al.  Automated extraction and semantic analysis of mutation impacts from the biomedical literature , 2012, BMC Genomics.

[23]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[24]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[25]  Karin M. Verspoor,et al.  Literature mining of protein-residue associations with graph rules learned through distant supervision , 2012, J. Biomed. Semant..

[26]  René Witte,et al.  Mutation Mining—A Prospector's Tale , 2006, Inf. Syst. Frontiers.

[27]  Olivier Bodenreider,et al.  A mutation-centric approach to identifying pharmacogenomic relations in text , 2012, J. Biomed. Informatics.

[28]  M. Schenck,et al.  Extraction of Genetic Mutations Associated with Cancer from Public Literature , 2012 .

[29]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[30]  Alfonso Valencia,et al.  wKinMut: An integrated tool for the analysis and interpretation of mutations in human protein kinases , 2013, BMC Bioinformatics.

[31]  David Martínez,et al.  Extraction of Named Entities from Tables in Gene Mutation Literature , 2009, BioNLP@HLT-NAACL.

[32]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[33]  Dietrich Rebholz-Schuhmann,et al.  Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb , 2009, BMC Bioinformatics.