论文信息 - BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis

BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis

This work presents a new, original document classification dataset, BioSift, to expedite the initial selection and labeling of studies for drug repurposing. The dataset consists of 10,000 human-annotated abstracts from scientific articles in PubMed. Each abstract is labeled with up to eight attributes necessary to perform meta-analysis utilizing the popular patient-intervention-comparator-outcome (PICO) method: has human subjects, is clinical trial/cohort, has population size, has target disease, has study drug, has comparator group, has a quantitative outcome, and an "aggregate" label. Each abstract was annotated by 3 different annotators (i.e., biomedical students) and randomly sampled abstracts were reviewed by senior annotators to ensure quality. Data statistics such as reviewer agreement, label co-occurrence, and confidence are shown. Robust benchmark results illustrate neither PubMed advanced filters nor state-of-the-art document classification schemes (e.g., active learning, weak supervision, full supervision) can efficiently replace human annotation. In short, BioSift is a pivotal but challenging document classification task to expedite drug repurposing. The full annotated dataset is publicly available and enables research development of algorithms for document classification that enhance drug repurposing.

[1] Wouter van Atteveldt,et al. Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI , 2023, Political Analysis.

[2] David Jurgens,et al. POTATO: The Portable Text Annotation Tool , 2022, EMNLP.

[3] Cassie S. Mitchell,et al. CCS Explorer: Relevance Prediction, Extractive Summarization, and Named Entity Recognition from Clinical Cohort Studies , 2022, 2022 IEEE International Conference on Big Data (Big Data).

[4] Byron C. Wallace,et al. In a pilot study, automated real-time systematic review updates were feasible, accurate, and work-saving. , 2022, Journal of clinical epidemiology.

[5] Hoifung Poon,et al. Knowledge-Rich Self-Supervision for Biomedical Entity Linking , 2021, EMNLP.

[6] Dimitrios A. Koutsomitropoulos,et al. Validating Ontology-based Annotations of Biomedical Resources using Zero-shot Learning , 2021, The 12th International Conference on Computational Systems-Biology and Bioinformatics.

[7] Tal Perry,et al. LightTag: Text Annotation Platform , 2021, EMNLP.

[8] Martin Potthast,et al. Revisiting Uncertainty-based Query Strategies for Active Learning with Transformers , 2021, FINDINGS.

[9] D. Sahoo,et al. Drug repurposing screens identify chemical entities for the development of COVID-19 interventions , 2021, Nature Communications.

[10] Lawrence L. He,et al. Biomedical Text Link Prediction for Drug Discovery: A Case Study with COVID-19 , 2021, Pharmaceutics.

[11] Kush R. Varshney,et al. Exploring the Efficacy of Generic Drugs in Treating Cancer , 2021, AAAI.

[12] Steve McDonald,et al. Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews , 2020, Journal of clinical epidemiology.

[13] Zaiqiao Meng,et al. Self-Alignment Pretraining for Biomedical Entity Representations , 2020, NAACL.

[14] Jianfeng Gao,et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[15] M. Zaheer,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[16] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[17] Kush R. Varshney,et al. A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications , 2020, AAAI.

[18] M. Mckee,et al. Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. , 2020, JAMA.

[19] Ali Masoudi-Nejad,et al. Drug databases and their contributions to drug repurposing. , 2020, Genomics.

[20] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[21] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22] Dan Roth,et al. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach , 2019, EMNLP.

[23] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[25] Junyi Jessy Li,et al. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[26] Byron C. Wallace,et al. Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide , 2018, Research synthesis methods.

[27] Vinay Prasad,et al. Research and Development Spending to Bring a Single Cancer Drug to Market and Revenues After Approval , 2017, JAMA internal medicine.

[28] Christopher Ré,et al. Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[29] Cassie S. Mitchell,et al. Undergraduate Biocuration: Developing Tomorrow's Researchers While Mining Today's Data. , 2015, Journal of undergraduate neuroscience education : JUNE : a publication of FUN, Faculty for Undergraduate Neuroscience.

[30] Anna Korhonen,et al. Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review , 2013, Bioinform..

[31] H. Bastian,et al. Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? , 2010, PLoS medicine.

[32] C. Chong,et al. New uses for old drugs , 2007, Nature.

[33] Andrew McCallum,et al. Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[34] Lawrence O. Hall,et al. Active learning to recognize multiple types of plankton , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[35] T. Ashburn,et al. Drug repositioning: identifying and developing new uses for existing drugs , 2004, Nature Reviews Drug Discovery.

[36] Stefan Wrobel,et al. Active Hidden Markov Models for Information Extraction , 2001, IDA.

[37] Andrew McCallum,et al. Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[38] Greg Schohn,et al. Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[39] K. Dickersin,et al. Systematic Reviews: Identifying relevant studies for systematic reviews , 1994 .

[40] William A. Gale,et al. A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[41] Martin Potthast,et al. Small-Text: Active Learning for Text Classification in Python , 2023, EACL.

[42] Ioana Baldini,et al. Biomedical Corpus Filtering: A Weak Supervision Paradigm With Infused Domain Expertise , 2021, SDU@AAAI.

[43] Radu Florian,et al. IBM MNLP IE at CASE 2021 Task 2: NLI Reranking for Zero-Shot Text Classification , 2021, CASE.

[44] Malaikannan Sankarasubbu,et al. BioELECTRA:Pretrained Biomedical text Encoder using Discriminators , 2021, BIONLP.

[45] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.