Automatic Human-like Mining and Constructing Reliable Genetic Association Database with Deep Reinforcement Learning

The increasing amount of scientific literature in biological and biomedical science research has created a challenge in the continuous and reliable curation of the latest knowledge discovered, and automatic biomedical text-mining has been one of the answers to this chal-lenge. In this paper, we aim to further improve the reliability of biomedical text-mining by training the system to directly simulate the human behaviors such as querying the PubMed, selecting articles from queried results, and reading selected articles for knowledge. We take advantage of the efficiency of biomedical text-mining, the flexibility of deep reinforcement learning, and the massive amount of knowledge collected in UMLS into an integrative arti-ficial intelligent reader that can automatically identify the authentic articles and effectively acquire the knowledge conveyed in the articles. We construct a system, whose current pri-mary task is to build the genetic association database between genes and complex traits of the human. Our contributions in this paper are three-fold: 1) We propose to improve the reliability of text-mining by building a system that can directly simulate the behavior of a researcher, and we develop corresponding methods, such as Bi-directional LSTM for text mining and Deep Q-Network for organizing behaviors. 2) We demonstrate the effec-tiveness of our system with an example in constructing a genetic association database. 3) We release our implementation as a generic framework for researchers in the community to conveniently construct other databases.

[1]  Yi Guo,et al.  OC-2-KB: integrating crowdsourcing into an obesity and cancer knowledge base curation system , 2018, BMC Medical Informatics and Decision Making.

[2]  Haohan Wang,et al.  Deep Learning for Genomics: A Concise Overview , 2018, ArXiv.

[3]  Jiawei Han,et al.  Annotating gene sets by mining large literature collections with protein networks , 2017, PSB.

[4]  Yi Guo,et al.  OC-2-KB: A software pipeline to build an evidence-based obesity and cancer knowledge base , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[5]  Yijia Zhang,et al.  An attention-based effective neural model for drug-drug interactions extraction , 2017, BMC Bioinformatics.

[6]  Clayton T. Morrison,et al.  Learning what to read: Focused machine reading , 2017, EMNLP.

[7]  Jason H. Moore,et al.  Mapping Patient Trajectories using Longitudinal Extraction and Deep Learning in the MIMIC-III Critical Care Database , 2017, bioRxiv.

[8]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[9]  Fei Li,et al.  A neural joint model for entity and relation extraction from biomedical text , 2017, BMC Bioinformatics.

[10]  Kalpana Raja,et al.  A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries , 2017, International journal of genomics.

[11]  Bhiksha Raj,et al.  On the Origin of Deep Learning , 2017, ArXiv.

[12]  Halil Kilicoglu,et al.  Biomedical Text Mining for Research Rigor and Integrity: Tasks, Challenges, Directions , 2017, bioRxiv.

[13]  Hong Yu,et al.  Learning for Biomedical Information Extraction: Methodological Review of Recent Advances , 2016, ArXiv.

[14]  Regina Barzilay,et al.  Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning , 2016, EMNLP.

[15]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[16]  Zhenchao Jiang,et al.  Training word embeddings for deep learning in biomedical text mining tasks , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[17]  Casey S. Greene,et al.  Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery , 2015, Briefings Bioinform..

[18]  Christopher Ré,et al.  Large-scale extraction of gene interactions from full-text literature using DeepDive , 2015, Bioinform..

[19]  Manuela M. Veloso,et al.  AskWorld: Budget-Sensitive Query Evaluation for Knowledge-on-Demand , 2015, IJCAI.

[20]  W. Alkema,et al.  Application of text mining in the biomedical domain. , 2015, Methods.

[21]  Marc G. Bellemare,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Hoifung Poon,et al.  Literome: PubMed-scale genomic knowledge base in the cloud , 2014, Bioinform..

[23]  Marianne Winslett,et al.  Which concepts are worth extracting? , 2014, SIGMOD Conference.

[24]  Rahul Gupta,et al.  Knowledge base completion via search-based question answering , 2014, WWW.

[25]  Kevin Bretonnel Cohen,et al.  Biomedical Natural Language Processing , 2014 .

[26]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[27]  Eneida A. Mendonça,et al.  Genetic data and electronic health records: a discussion of ethical, logistical and technological considerations , 2013, J. Am. Medical Informatics Assoc..

[28]  Andrew McCallum,et al.  Selecting actions for resource-bounded information extraction using reinforcement learning , 2012, WSDM '12.

[29]  G. Poste Bring on the biomarkers , 2011, Nature.

[30]  Jacob de Vlieg,et al.  Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases , 2010, PLoS Comput. Biol..

[31]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[32]  J. Schmidhuber,et al.  2005 Special Issue: Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005 .

[33]  W. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[34]  K. Becker,et al.  The Genetic Association Database , 2004, Nature Genetics.

[35]  N. Campbell Genetic association database , 2004, Nature Reviews Genetics.

[36]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[37]  Li Li,et al.  Automated disease cohort selection using word embeddings from Electronic Health Records , 2018, PSB.

[38]  Nicole A. Restrepo,et al.  Development and Performance of Text-Mining Algorithms to Extract Socioeconomic Status from De-Identified Electronic Health Records , 2017, PSB.

[39]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[40]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[41]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[42]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.