LitGen: Genetic Literature Recommendation Guided by Human Explanations

As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathogenicity using different types of approaches and evidences—e.g. biochemical assays or case control analysis. In collaboration with the Clinical Genomic Resource (ClinGen)—the flagship NIH program for clinical curation—we propose the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity. LitGen uses semi-supervised deep learning to predict the type of evidence provided by each paper. It is trained on papers annotated by ClinGen curators and systematically evaluated on new test data collected by ClinGen. LitGen further leverages rich human explanations and unlabeled data to gain 7.9%-12.6% relative performance improvement over models learned only on the annotated papers. It is a useful framework to improve clinical variant curation.

[1]  Christopher Ré,et al.  A machine-compiled database of genome-wide association studies , 2019, Nature Communications.

[2]  Noah D. Goodman,et al.  Learning to Explain: Answering Why-Questions via Rephrasing , 2019, ArXiv.

[3]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[4]  Christopher Ré,et al.  Snorkel MeTaL: Weak Supervision for Multi-Task Learning , 2018, DEEM@SIGMOD.

[5]  Yan Zhou,et al.  Democratic co-learning , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[6]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Jonathan C. Cohen,et al.  Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. , 2006, The New England journal of medicine.

[9]  Bale,et al.  Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology , 2015, Genetics in Medicine.

[10]  Tom M. Mitchell,et al.  Joint Concept Learning and Semantic Parsing from Natural Language Explanations , 2017, EMNLP.

[11]  Yifan Peng,et al.  LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC , 2018, Nucleic Acids Res..

[12]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[13]  Gill Bejerano,et al.  AMELIE accelerates Mendelian patient diagnosis directly from the primary literature , 2017, bioRxiv.

[14]  James Zou,et al.  DeepTag: inferring diagnoses from veterinary clinical notes , 2018, npj Digital Medicine.

[15]  Christopher Ré,et al.  Training Classifiers with Natural Language Explanations , 2018, ACL.

[16]  Eric Boerwinkle,et al.  Sequence Variations in PCSK 9 , Low LDL , and Protection against Coronary Heart Disease , 2006 .

[17]  Anna Wojas-Pelc,et al.  Model-based prediction of human hair color using DNA variants , 2011, Human Genetics.

[18]  Aleksandar Milosavljevic,et al.  ClinGen Allele Registry links information about genetic variants , 2018, Human mutation.

[19]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[20]  W. Marsden I and J , 2012 .

[21]  Heidi L Rehm,et al.  ClinGen--the Clinical Genome Resource. , 2015, The New England journal of medicine.

[22]  Zhiyong Lu,et al.  tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine , 2018, Bioinform..

[23]  James Zou,et al.  VetTag: improving automated veterinary diagnosis coding via large-scale language modeling , 2019, npj Digital Medicine.