BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis

This work presents a new, original document classification dataset, BioSift, to expedite the initial selection and labeling of studies for drug repurposing. The dataset consists of 10,000 human-annotated abstracts from scientific articles in PubMed. Each abstract is labeled with up to eight attributes necessary to perform meta-analysis utilizing the popular patient-intervention-comparator-outcome (PICO) method: has human subjects, is clinical trial/cohort, has population size, has target disease, has study drug, has comparator group, has a quantitative outcome, and an "aggregate" label. Each abstract was annotated by 3 different annotators (i.e., biomedical students) and randomly sampled abstracts were reviewed by senior annotators to ensure quality. Data statistics such as reviewer agreement, label co-occurrence, and confidence are shown. Robust benchmark results illustrate neither PubMed advanced filters nor state-of-the-art document classification schemes (e.g., active learning, weak supervision, full supervision) can efficiently replace human annotation. In short, BioSift is a pivotal but challenging document classification task to expedite drug repurposing. The full annotated dataset is publicly available and enables research development of algorithms for document classification that enhance drug repurposing.

[1]  Wouter van Atteveldt,et al.  Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI , 2023, Political Analysis.

[2]  David Jurgens,et al.  POTATO: The Portable Text Annotation Tool , 2022, EMNLP.

[3]  Cassie S. Mitchell,et al.  CCS Explorer: Relevance Prediction, Extractive Summarization, and Named Entity Recognition from Clinical Cohort Studies , 2022, 2022 IEEE International Conference on Big Data (Big Data).

[4]  Byron C. Wallace,et al.  In a pilot study, automated real-time systematic review updates were feasible, accurate, and work-saving. , 2022, Journal of clinical epidemiology.

[5]  Hoifung Poon,et al.  Knowledge-Rich Self-Supervision for Biomedical Entity Linking , 2021, EMNLP.

[6]  Dimitrios A. Koutsomitropoulos,et al.  Validating Ontology-based Annotations of Biomedical Resources using Zero-shot Learning , 2021, The 12th International Conference on Computational Systems-Biology and Bioinformatics.

[7]  Tal Perry,et al.  LightTag: Text Annotation Platform , 2021, EMNLP.

[8]  Martin Potthast,et al.  Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers , 2021, FINDINGS.

[9]  D. Sahoo,et al.  Drug repurposing screens identify chemical entities for the development of COVID-19 interventions , 2021, Nature Communications.

[10]  Lawrence L. He,et al.  Biomedical Text Link Prediction for Drug Discovery: A Case Study with COVID-19 , 2021, Pharmaceutics.

[11]  Kush R. Varshney,et al.  Exploring the Efficacy of Generic Drugs in Treating Cancer , 2021, AAAI.

[12]  Steve McDonald,et al.  Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews , 2020, Journal of clinical epidemiology.

[13]  Zaiqiao Meng,et al.  Self-Alignment Pretraining for Biomedical Entity Representations , 2020, NAACL.

[14]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[15]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[16]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[17]  Kush R. Varshney,et al.  A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications , 2020, AAAI.

[18]  M. Mckee,et al.  Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. , 2020, JAMA.

[19]  Ali Masoudi-Nejad,et al.  Drug databases and their contributions to drug repurposing. , 2020, Genomics.

[20]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[21]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22]  Dan Roth,et al.  Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach , 2019, EMNLP.

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[25]  Junyi Jessy Li,et al.  A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[26]  Byron C. Wallace,et al.  Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide , 2018, Research synthesis methods.

[27]  Vinay Prasad,et al.  Research and Development Spending to Bring a Single Cancer Drug to Market and Revenues After Approval , 2017, JAMA internal medicine.

[28]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[29]  Cassie S. Mitchell,et al.  Undergraduate Biocuration: Developing Tomorrow's Researchers While Mining Today's Data. , 2015, Journal of undergraduate neuroscience education : JUNE : a publication of FUN, Faculty for Undergraduate Neuroscience.

[30]  Anna Korhonen,et al.  Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review , 2013, Bioinform..

[31]  H. Bastian,et al.  Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? , 2010, PLoS medicine.

[32]  C. Chong,et al.  New uses for old drugs , 2007, Nature.

[33]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[34]  Lawrence O. Hall,et al.  Active learning to recognize multiple types of plankton , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[35]  T. Ashburn,et al.  Drug repositioning: identifying and developing new uses for existing drugs , 2004, Nature Reviews Drug Discovery.

[36]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[37]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[38]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[39]  K. Dickersin,et al.  Systematic Reviews: Identifying relevant studies for systematic reviews , 1994 .

[40]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[41]  Martin Potthast,et al.  Small-Text: Active Learning for Text Classification in Python , 2023, EACL.

[42]  Ioana Baldini,et al.  Biomedical Corpus Filtering: A Weak Supervision Paradigm With Infused Domain Expertise , 2021, SDU@AAAI.

[43]  Radu Florian,et al.  IBM MNLP IE at CASE 2021 Task 2: NLI Reranking for Zero-Shot Text Classification , 2021, CASE.

[44]  Malaikannan Sankarasubbu,et al.  BioELECTRA:Pretrained Biomedical text Encoder using Discriminators , 2021, BIONLP.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.