Reusing label functions to extract multiple types of biomedical relationships from biomedical abstracts at scale

Knowledge bases support multiple research efforts such as providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. Some knowledge bases are automatically constructed, but most are populated via some form of manual curation. Manual curation is time consuming and difficult to scale in the context of an increasing publication rate. A recently described “data programming” paradigm seeks to circumvent this arduous process by combining distant supervision with simple rules and heuristics written as labeling functions that can be automatically applied to inputs. Unfortunately writing useful label functions requires substantial error analysis and is a nontrivial task: in early efforts to use data programming we found that producing each label function could take a few days. Producing a biomedical knowledge base with multiple node and edge types could take hundreds or possibly thousands of label functions. In this paper we sought to evaluate the extent to which label functions could be re-used across edge types. We used a subset of Hetionet v1 that centered on disease, compound, and gene nodes to evaluate this approach. We compared a baseline distant supervision model with the same distant supervision resources added to edge-type-specific label functions, edge-type-mismatch label functions, and all label functions. We confirmed that adding additional edge-type-specific label functions improves performance. We also found that adding one or a few edge-type-mismatch label functions nearly always improved performance. Adding a large number of edge-type-mismatch label functions produce variable performance that depends on the edge type being predicted and the label function’s edge type source. Lastly, we show that this approach, even on this subgraph of Hetionet, could add new edges to Hetionet v1 with high confidence. We expect that practical use of this strategy would include additional filtering and scoring methods which would further enhance precision.

[1]  Petras J. Kundrotas,et al.  Text Mining for Protein Docking , 2015, PLoS Comput. Biol..

[2]  Shixian Ning,et al.  Knowledge-guided convolutional networks for chemical-disease relation extraction , 2019, BMC Bioinformatics.

[3]  Christopher Ré,et al.  Large-scale extraction of gene interactions from full-text literature using DeepDive , 2015, Bioinform..

[4]  Núria Queralt-Rosinach,et al.  Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research , 2014, BMC Bioinformatics.

[5]  Dong Xu,et al.  DTMiner: identification of potential disease targets through biomedical literature mining , 2016, Bioinform..

[6]  Yousof Al-Hammadi,et al.  Analyzing a co-occurrence gene-interaction network to identify disease-gene association , 2019, BMC Bioinformatics.

[7]  Hongfang Liu,et al.  A new method for prioritizing drug repositioning candidates extracted by literature-based discovery , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[8]  Jacob de Vlieg,et al.  Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases , 2010, PLoS Comput. Biol..

[9]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[10]  Jong C. Park,et al.  CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations , 2013, BMC Bioinformatics.

[11]  Hung-Yu Kao,et al.  Cross-species gene normalization by species inference , 2011, BMC Bioinformatics.

[12]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[13]  Yifan Peng,et al.  Extracting chemical–protein relations with ensembles of SVM and deep learning models , 2018, Database J. Biol. Databases Curation.

[14]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[15]  Robin Champieux,et al.  An analysis and metric of reusable data licensing practices for biomedical resources , 2019, PloS one.

[16]  W. Alkema,et al.  Application of text mining in the biomedical domain. , 2015, Methods.

[17]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[18]  Yung-Chun Chang,et al.  LPTK: a linguistic pattern-aware dependency tree kernel approach for the BioCreative VI CHEMPROT task , 2018, Database J. Biol. Databases Curation.

[19]  Christopher Ré,et al.  Snorkel MeTaL: Weak Supervision for Multi-Task Learning , 2018, DEEM@SIGMOD.

[20]  Jie Zhou,et al.  The research on gene-disease association based on text-mining of PubMed , 2018, BMC Bioinformatics.

[21]  Sung-Pil Choi,et al.  Extraction of protein–protein interactions (PPIs) from the literature by deep convolutional neural networks with various feature embeddings , 2018, J. Inf. Sci..

[22]  Kimberly Van Auken,et al.  Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature , 2018, BMC Bioinformatics.

[23]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[24]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[25]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[26]  Kalpana Raja,et al.  HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways , 2015, J. Biomed. Informatics.

[27]  Russ B. Altman,et al.  Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text , 2009, BMC Bioinformatics.

[28]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[29]  Helen E. Parkinson,et al.  The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) , 2016, Nucleic Acids Res..

[30]  Martin Krallinger,et al.  LimTox: a web tool for applied text mining of adverse event and toxicity associations of compounds, drugs and genes , 2017, Nucleic Acids Res..

[31]  A. Barabasi,et al.  Uncovering disease-disease relationships through the incomplete interactome , 2015, Science.

[32]  Laura Inés Furlong,et al.  The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships , 2012, J. Biomed. Informatics.

[33]  Christian Stolte,et al.  Comprehensive comparison of large-scale tissue expression datasets , 2015, bioRxiv.

[34]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[35]  Hyunjin Kim,et al.  LGscore: A method to identify disease-related genes using biological literature and Google data , 2015, J. Biomed. Informatics.

[36]  Thomas C. Wiegers,et al.  Collaborative biocuration—text-mining development task for document prioritization for curation , 2012, Database J. Biol. Databases Curation.

[37]  Jaewoo Kang,et al.  Chemical–gene relation extraction using recursive neural network , 2018, Database J. Biol. Databases Curation.

[38]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[39]  Russ B. Altman,et al.  A global network of biomedical relationships derived from text , 2018, Bioinform..

[40]  Kotagiri Ramamohanarao,et al.  Exploiting graph kernels for high performance biomedical relation extraction , 2018, Journal of Biomedical Semantics.

[41]  Pushpak Bhattacharyya,et al.  Feature assisted stacked attentive shortest dependency path based Bi-LSTM model for protein-protein interaction , 2019, Knowl. Based Syst..

[42]  Lars Juhl Jensen,et al.  CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision , 2018, bioRxiv.

[43]  Xiaoyan Zhu,et al.  GeneTUKit: a software for document-level gene normalization , 2011, Bioinform..

[44]  Søren Brunak,et al.  A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts , 2018, PLoS Comput. Biol..

[45]  Balu Bhasuran,et al.  Automatic extraction of gene-disease associations from literature using joint ensemble learning , 2018, PloS one.

[46]  Tiziana di Matteo,et al.  Graph Theory Enables Drug Repurposing – How a Mathematical Model Can Drive the Discovery of Hidden Mechanisms of Action , 2013, PloS one.

[47]  Kalpana Raja,et al.  PPInterFinder—a mining tool for extracting causal relations on human proteins from literature , 2013, Database J. Biol. Databases Curation.

[48]  Hongfang Liu,et al.  Extracting chemical–protein relations using attention-based neural networks , 2018, Database J. Biol. Databases Curation.

[49]  David S. Wishart,et al.  PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more , 2015, Nucleic Acids Res..

[50]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[51]  Min Song,et al.  PKDE4J: Entity and relation extraction for public knowledge discovery , 2015, J. Biomed. Informatics.

[52]  Daniel Himmelstein,et al.  Mining knowledge from MEDLINE articles and their indexed MeSH terms , 2015 .

[53]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[54]  Ian M. Donaldson,et al.  iRefIndex: A consolidated protein interaction database with provenance , 2008, BMC Bioinformatics.

[55]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[56]  Cathy H. Wu,et al.  RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[57]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[58]  Tingting Zhao,et al.  Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering , 2019, Database.

[59]  Rong Xu,et al.  Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing , 2013, BMC Bioinformatics.

[60]  Bridget E. Begg,et al.  A Proteome-Scale Map of the Human Interactome Network , 2014, Cell.

[61]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[62]  Robert Hoehndorf,et al.  Drug repurposing through joint learning on knowledge graphs and literature , 2018, bioRxiv.

[63]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[64]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[65]  Christian Stolte,et al.  COMPARTMENTS: unification and visualization of protein subcellular localization evidence , 2014, Database J. Biol. Databases Curation.

[66]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[67]  Simon Oxenham,et al.  Legal confusion threatens to slow data science , 2016, Nature.

[68]  John Boyle,et al.  Improving the learning of chemical-protein interactions from literature using transfer learning and specialized word embeddings , 2018, Database J. Biol. Databases Curation.

[69]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.