variant2literature: full text literature search for genetic variants

Motivation Whole genome sequencing (WGS) by next-generation sequencing produces millions of variants for an individual. The retrieval of biomedical literature for such a large number of genetic variants remains challenging, because in many cases the variants are only present in tables as images, or in the supplementary documents of which the file formats are diverse. Results The proposed tool named variant2literature from the TaiGenomics (Toolkits for AI genomics) resolves the problem by incorporating text recognition with image processing. In addition to the adoption of advanced text retrieval, the recall rate of finding the literature containing the variants of interest is further improved by employing the skill of variant normalization. Different variant presentations are transformed into chromosome coordinates (standard VCF format) such that false negatives can be largely avoided. variant2literature is available in two ways. First, a web-based interface is provided to search all the literature in PMC Open Access Subset. Second, the command-line executable can be downloaded such that the users are free to search all the files in a specified directory locally. Availability http://variant2literature.taigenomics.com/ Contact chienyuchen@ntu.edu.tw

[1]  Peter Donnelly,et al.  Clinical whole-genome sequencing in severe early-onset epilepsy reveals new genes and improves molecular diagnosis , 2014, Human molecular genetics.

[2]  Yifan Peng,et al.  LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC , 2018, Nucleic Acids Res..

[3]  Michael Brudno,et al.  Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine , 2016, npj Genomic Medicine.

[4]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[5]  Zhiyong Lu,et al.  tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine , 2018, Bioinform..

[6]  Magalie S Leduc,et al.  Clinical whole-exome sequencing for the diagnosis of mendelian disorders. , 2013, The New England journal of medicine.

[7]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[8]  Burkhard Rost,et al.  nala: text mining natural language mutation mentions , 2017, Bioinform..

[9]  T. Casavant,et al.  Genomic Landscape and Mutational Signatures of Deafness-Associated Genes , 2018, American journal of human genetics.

[10]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.