Scaling up data curation using deep learning: An application to literature triage in genomic variation resources

Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.

[1]  Karin M. Verspoor,et al.  Open Peer Review Invited Referee Responses , 2022 .

[2]  Zhiyong Lu,et al.  Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II , 2012, Database J. Biol. Databases Curation.

[3]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[4]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[5]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[6]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[8]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[9]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[10]  Chitta Baral,et al.  A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions , 2012, J. Biomed. Informatics.

[11]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[12]  Justin Powlowski,et al.  Curation of characterized glycoside hydrolases of Fungal origin , 2011, Database J. Biol. Databases Curation.

[13]  Olivier Bodenreider,et al.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature , 2011, Bioinform..

[14]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[15]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[16]  Peter D. Karp,et al.  Curation accuracy of model organism databases , 2014, Database J. Biol. Databases Curation.

[17]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Adrian Tsang,et al.  Machine Learning for Biomedical Literature Triage , 2014, PloS one.

[20]  Engin Bozdag,et al.  Bias in algorithmic filtering and personalization , 2013, Ethics and Information Technology.

[21]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[22]  Zhiyong Lu,et al.  tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine , 2018, Bioinform..

[23]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[26]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[27]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[28]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[29]  Hans-Michael Müller,et al.  A hybrid human and machine resource curation pipeline for the Neuroscience Information Framework , 2012, Database J. Biol. Databases Curation.

[30]  Bowen Zhou,et al.  Classifying Relations by Ranking with Convolutional Neural Networks , 2015, ACL.

[31]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  T. Hubbard,et al.  A census of human cancer genes , 2004, Nature Reviews Cancer.

[33]  Jon R Lorsch,et al.  Perspective: Sustaining the big-data ecosystem , 2015, Nature.

[34]  Jerven T. Bolleman,et al.  Genetic Variations and Diseases in UniProtKB/Swiss-Prot: The Ins and Outs of Expert Manual Curation , 2014, Human mutation.

[35]  Claire O'Donovan,et al.  Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data , 2014, Database J. Biol. Databases Curation.

[36]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[37]  Karin M. Verspoor,et al.  Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts , 2016, BMC Medical Informatics and Decision Making.

[38]  Hongfei Lin,et al.  Drug drug interaction extraction from biomedical literature using syntax convolutional neural network , 2016, Bioinform..

[39]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[40]  Kimberly Van Auken,et al.  Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR , 2012, Database J. Biol. Databases Curation.

[41]  Fabio Persia,et al.  Challenge: Processing web texts for classifying job offers , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[42]  Won-Ho Shin,et al.  Deep learning of mutation-gene-drug relations from the literature , 2017, BMC Bioinformatics.

[43]  Zhiyong Lu,et al.  Hybrid curation of gene–mutation relations combining automated extraction and crowdsourcing , 2014, Database J. Biol. Databases Curation.