Biomedical Corpus Filtering: A Weak Supervision Paradigm With Infused Domain Expertise

Querying biomedical documents from large databases such as PubMed is traditionally keyword-based and usually results in large volumes of documents that lack specificity. A common bottleneck of further filtering using natural language processing (NLP) techniques stems from the need for a large amount of labeled data to train a machine learning model. To overcome this limitation, we are constructing an NLP pipeline to automatically label relevant published abstracts, without fitting to any hand-labeled training data, with the goal of identifying the most promising non-cancer generic drugs to repurpose for the treatment of cancer. This work aims to programmatically filter a large set of research articles as either relevant or non-relevant, where relevance is defined as those studies that have evaluated the efficacy of non-cancer generic drugs in cancer patient populations. We use Snorkel, a Python-based weak supervision modeling library, which allows domain expertise to be infused into heuristic rules. With a robust set of rules, promising classification accuracy can be cheaply achieved on a large set of documents, making this work easily applicable to other domains. A Natural Language Processing Pipeline for Drug Repurposing in Cancer Natural language processing (NLP) is currently being applied at scale to sift through millions of published biomedical studies and synthesize data from a portion that are deemed relevant. In order to successfully extract information from these studies, one must query a database with a combination of keywords related to the scope of the research. As a result, irrelevant studies that happen to match the keyword search but do not actually pertain to the initial intent must be filtered out of the document corpus. This issue motivates the need for a binary filtering model that can determine document relevance based on certain criteria. The work presented in this paper is part of a collaboration between cancer biology domain experts and data scientists to construct an NLP pipeline for the task of identifying the most promising FDA-approved non-cancer generic drugs to repurpose for the treatment of cancer [9]. While this ambitious endeavor requires several steps in order to extract drugCopyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). cancer evidence from scientific documents and ultimately arrive at a small set of drugs for further study, this paper focuses on the corpus filtering task. The premise of the approach presented in this paper is to build a model for understanding document “relevance” by way of de-noising many signals from a set of PubMed titles and abstracts automatically labeled by rules developed by domain experts. While in principle, a state-of-the-art BERT-based model [1] would presumably achieve higher accuracy for a binary classification task like the one under consideration, it also requires a large corpus of manually annotated documents, which is costly and time-consuming. These hand-labeled training sets can take months or years to develop for large benchmark sets, and require annotators with domain expertise since the type of documents under consideration are full of domainspecific jargon. Thus, we aim to circumvent this bottleneck by leveraging the knowledge of domain experts in order to construct a rule-based model that can programmatically label hundreds of thousands of documents with promising accuracy. This type of rule crafting takes considerably less time and it is less tedious than annotating thousands of documents. Weak Supervision and Snorkel In practice today, most machine learning systems use some form of weak or distant supervision: noisier, lower quality, but larger-scale training sets constructed via strategies such as using annotators, programmatic scripts, or high-level input from domain experts [6]. The intent is to harness human supervision more cheaply and efficiently. In this work, we encode domain expertise into heuristic rules while taking advantage of existing resources (i.e., knowledge bases, pretrained models). This method is advantageous for research applications in which a few dozen noisy rules or high-level constraints are able to effectively perform some task with comparable accuracy and at a much lower cost than a large set of labels from domain experts [6]. In order to apply weak supervision to the filtering task within the NLP-based drug repurposing pipeline, we use a software package called Snorkel1 [8]. Snorkel is an open source framework that is grounded in data programming, a Snorkel is a data programming paradigm that programmatically builds training data for supervised machine learning. field in which labels are derived from noisy label sources using generative models. Snorkel effectively de-noises signals from a given corpus without fitting to any labeled data, by implementing the following three key steps: 1. Construct heuristic rules called labeling functions (LFs). These rules are declared by humans, usually domain experts, and represent the only manual step in the Snorkel approach. Apply each of these m rules on all n documents to generate an m x n label matrix. 2. Snorkel pools noisy signals from the label matrix into a generative model using a factor graph approach which learns from the agreements and disagreements of the labeling functions, without access to any ground-truth data [7]. The output of this generative model are predictions for the binary classification of each document. 3. The predictions from the previous step can be used as probabilistic training labels for a noise-aware discriminative model which is intended to generalize beyond the information expressed in the labeling functions. To make it easier to define labeling rules, Snorkel adds a special label to the set of labels of the classification task: ABSTAIN. Whenever a rule can not make a decision for one of the labels for the task, it emits the ABSTAIN label. For our task, it is much easier to enumerate inclusion rules (i.e., labeling functions for documents that are considered relevant) than exclusion rules. For this reason, we experiment with marking all ABSTAIN labels as non-relevant. Biomedical Research Corpus PubMed, provided by the National Center for Biotechnology Information (NCBI), comprises over 40 million biomedical studies from MEDLINE, life science journals, and online books. The large set of unlabeled research studies to be programmatically filtered is sourced from PubMed using a Cochrane highly sensitive search (CHSS) strategy [2] to narrow the scope of our evidence discovery pipeline. Note that this query, even with certain keyword terms listed and publication types specified, does not yield only relevant articles, thus motivating the filtering task. In our experience, only about 30% of the articles end up being relevant for our purposes. The labeled set of documents for testing our procedure was manually generated by our domain experts. In this work, we focus on clinical studies and consider only publication abstracts. In our experience so far, publication abstracts are sufficiently detailed to decide whether an article is of interest or not. The different dataset splits that we used in our work are provided below with a brief explanation on how we used each split. • Unlabeled set [39843 documents]: The largest split, with no ground truth labels. • Test set [1413 documents]: A small, hand-labeled set for final evaluation of our classifier; this dataset is not available for inspection, only for evaluation such that our rules are not biased. • Development set [300 documents]: A small set of labeled documents used for inspection in the creation of rules and error analysis after the model has been applied. After initially querying PubMed, the datasets were split, duplicates were removed, and metadata was collected. The strongest source of signal was the title, which makes sense since it is the field with the most essential elements of the work described, including the drug and cancer type, and sometimes the type of study. An additional helpful feature was cancer concepts mentioned in the abstract and extracted via the Unified Medical Language System (UMLS) Linker based on ScispaCy [5]. Encoding Biomedical Expertise We devised a workflow for deeming articles as either relevant for the NLP-based drug repurposing pipeline (INCLUDE) or not relevant (EXCLUDE). A document is relevant if a non-cancer generic drug was tested for the treatment of cancer and if a phenotype-level outcome was reported. Some of the domain-level expertise encapsulated in this step includes terms that are frequently associated with cancer, deceptive terms that seem to be related to cancer but are actually not related (e.g., tumor necrosis factor), and relevant biomedical processes. The sequential workflow was manually converted into parallel, independent labeling functions in accordance with Snorkel’s Label Model package. These rules were treated as a baseline for our Snorkel model.