Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine

Abstract The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein–protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.

[1]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[2]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[3]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[4]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[5]  Yongqun He,et al.  The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature , 2016, BioData Mining.

[6]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[7]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[8]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[9]  Raja Mazumder,et al.  DiMeX: A Text Mining System for Mutation-Disease Association Extraction , 2016, PloS one.

[10]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[11]  W. John Wilbur,et al.  BioC viewer: a web-based tool for displaying and merging annotations in BioC , 2016, Database J. Biol. Databases Curation.

[12]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[13]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[14]  K. Bretonnel Cohen,et al.  MutationFinder: a high-performance system for extracting point mutation mentions from text , 2007, Bioinform..

[15]  Hongfei Lin,et al.  Document triage for identifying protein–protein interactions affected by mutations: a neural network ensemble approach , 2018, Database J. Biol. Databases Curation.

[16]  Jeyakumar Natarajan,et al.  Overview of the interactive task in BioCreative V , 2015, Database J. Biol. Databases Curation.

[17]  S. Perkins,et al.  CoagMDB: a database analysis of missense mutations within four conserved domains in five vitamin K–dependent coagulation serine proteases using a text‐mining tool , 2008, Human mutation.

[18]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[19]  Kara Dolinski,et al.  The BioC-BioGRID corpus: full text articles annotated for curation of protein–protein and genetic interactions , 2017, Database J. Biol. Databases Curation.

[20]  W. John Wilbur,et al.  PIE the search: searching PubMed literature for protein interaction information , 2012, Bioinform..

[21]  W. John Wilbur,et al.  Assisting manual literature curation for protein–protein interactions using BioQRator , 2014, Database J. Biol. Databases Curation.

[22]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[23]  Zhiyong Lu,et al.  tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine , 2018, Bioinform..

[24]  Zhiyong Lu,et al.  Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health. , 2016, Advances in experimental medicine and biology.

[25]  Steven J. Simske,et al.  On the Helmholtz Principle for Data Mining , 2012, 2012 Third International Conference on Emerging Security Technologies.

[26]  Kara Dolinski,et al.  The BioGRID interaction database: 2017 update , 2016, Nucleic Acids Res..

[27]  Steven J. Simske,et al.  Rapid change detection and text mining , 2011 .

[28]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[29]  Zhiyong Lu,et al.  Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II , 2012, Database J. Biol. Databases Curation.

[30]  Burkhard Rost,et al.  nala: text mining natural language mutation mentions , 2017, Bioinform..

[31]  Zhiyong Lu,et al.  Beyond accuracy: creating interoperable and scalable text-mining web services , 2016, Bioinform..

[32]  Zhiyong Lu,et al.  SR4GN: A Species Recognition Software Tool for Gene Normalization , 2012, PloS one.

[33]  R. Apweiler,et al.  MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data , 2008, Genome Biology.

[34]  Juliane Fluck,et al.  Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL) , 2016, Database J. Biol. Databases Curation.

[35]  Sutanu Chakraborti,et al.  Sprinkling: Supervised Latent Semantic Indexing , 2006, ECIR.

[36]  Arzucan Özgür,et al.  Classification using Ontology and Semantic Values of Terms for Mining Protein Interactions and Mutations , 2017 .

[37]  Miguel Pignatelli,et al.  Database: The Journal of Biological Databases and Curation , 2016 .

[38]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[39]  Yifan Peng,et al.  BioCreative VI Precision Medicine Track: creating a training corpus for mining protein-protein interactions affected by mutations , 2017, BioNLP.

[40]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[41]  André L. M. Santos,et al.  BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID , 2016, Database J. Biol. Databases Curation.

[42]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[43]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[44]  Kotagiri Ramamohanarao,et al.  Exploiting graph kernels for high performance biomedical relation extraction , 2018, Journal of Biomedical Semantics.

[45]  W. John Wilbur,et al.  Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora , 2014, Database J. Biol. Databases Curation.

[46]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[47]  Rafael C. Jimenez,et al.  The MIntAct project—IntAct as a common curation platform for 11 molecular interaction databases , 2013, Nucleic Acids Res..

[48]  Zhiyong Lu,et al.  Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine , 2016, PLoS Comput. Biol..

[49]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[50]  Zhiyong Lu,et al.  tmVar: a text mining approach for extracting sequence variants in biomedical literature , 2013, Bioinform..

[51]  Zhiyong Lu,et al.  The BioCreative VI Precision Medicine Track corpus Selection , annotation and curation of protein-protein interactions affected by mutations in scientific literature , 2017 .

[52]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[53]  Yifan Peng,et al.  BioC interoperability track overview , 2014, Database J. Biol. Databases Curation.

[54]  Jeyakumar Natarajan,et al.  An overview of the BioCreative 2012 Workshop Track III: interactive text mining task , 2013, Database J. Biol. Databases Curation.

[55]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[56]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[58]  Trey Ideker,et al.  Genotype to phenotype via network analysis. , 2013, Current opinion in genetics & development.

[59]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..