Using deep learning to identify translational research in genomic medicine beyond bench to bedside

Abstract Tracking scientific research publications on the evaluation, utility and implementation of genomic applications is critical for the translation of basic research to impact clinical and population health. In this work, we utilize state-of-the-art machine learning approaches to identify translational research in genomics beyond bench to bedside from the biomedical literature. We apply the convolutional neural networks (CNNs) and support vector machines (SVMs) to the bench/bedside article classification on the weekly manual annotation data of the Public Health Genomics Knowledge Base database. Both classifiers employ salient features to determine the probability of curation-eligible publications, which can effectively reduce the workload of manual triage and curation process. We applied the CNNs and SVMs to an independent test set (n = 400), and the models achieved the F-measure of 0.80 and 0.74, respectively. We further tested the CNNs, which perform better results, on the routine annotation pipeline for 2 weeks and significantly reduced the effort and retrieved more appropriate research articles. Our approaches provide direct insight into the automated curation of genomic translational research beyond bench to bedside. The machine learning classifiers are found to be helpful for annotators to enhance the efficiency of manual curation.

[1]  Zhiyong Lu,et al.  Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information , 2012, Database J. Biol. Databases Curation.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Yann LeCun,et al.  Very Deep Convolutional Networks for Natural Language Processing , 2016, ArXiv.

[4]  Ying Zhang,et al.  The neXtProt knowledgebase on human proteins: current status , 2014, Nucleic Acids Res..

[5]  Zhiyong Lu,et al.  Scaling up data curation using deep learning: An application to literature triage in genomic variation resources , 2018, PLoS Comput. Biol..

[6]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[7]  Yifan Peng,et al.  BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations , 2017, BioNLP.

[8]  Elizabeth C. Whipple,et al.  Classifying publications from the clinical and translational science award program along the translational research spectrum: a machine learning approach , 2016, Journal of Translational Medicine.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  David A. Chambers,et al.  The current state of implementation science in genomic medicine: opportunities for improvement , 2017, Genetics in Medicine.

[11]  Muin J Khoury,et al.  A knowledge base for tracking the impact of genomics on population health , 2016, Genetics in Medicine.

[12]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  M. Khoury,et al.  The continuum of translation research in genomic medicine: how can we accelerate the appropriate integration of human genome discoveries into health care and disease prevention? , 2007, Genetics in Medicine.

[15]  Hongfei Lin,et al.  Document triage for identifying protein–protein interactions affected by mutations: a neural network ensemble approach , 2018, Database J. Biol. Databases Curation.

[16]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[17]  A. Brand,et al.  Public Health Genomics , 2007, Bundesgesundheitsblatt - Gesundheitsforschung - Gesundheitsschutz.

[18]  Zhiyong Lu,et al.  Assisting document triage for human kinome curation via machine learning , 2018, Database J. Biol. Databases Curation.

[19]  Andreas Holzinger,et al.  Interactive machine learning for health informatics: when do we need the human-in-the-loop? , 2016, Brain Informatics.

[20]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[21]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22]  Zhiyong Lu,et al.  MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank , 2017, Journal of Biomedical Semantics.

[23]  Muin J. Khoury,et al.  Horizon Scanning for Translational Genomic Research Beyond Bench to Bedside , 2014, Genetics in Medicine.

[24]  Cathy H. Wu,et al.  eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality , 2017, Database J. Biol. Databases Curation.

[25]  Freddy Lécué,et al.  Explainable AI: The New 42? , 2018, CD-MAKE.