Trialstreamer: A living, automatically updated database of clinical trial reports

Objective Randomized controlled trials (RCTs) are the gold standard method for evaluating whether a treatment works in healthcare, but can be difficult to find and make use of. We describe the development and evaluation of a system to automatically find and categorize all new RCT reports. Materials and Methods Trialstreamer, continuously monitors PubMed and the WHO International Clinical Trials Registry Platform (ICTRP), looking for new RCTs in humans using a validated classifier. We combine machine learning and rule-based methods to extract information from the RCT abstracts, including free-text descriptions of trial populations, interventions and outcomes (the 'PICO') and map these snippets to normalised MeSH vocabulary terms. We additionally identify sample sizes, predict the risk of bias, and extract text conveying key findings. We store all extracted data in a database which we make freely available for download, and via a search portal, which allows users to enter structured clinical queries. Results are ranked automatically to prioritize larger and higher-quality studies. Results As of May 2020, we have indexed 669,895 publications of RCTs, of which 18,485 were published in the first four months of 2020 (144/day). We additionally include 303,319 trial registrations from ICTRP. The median trial sample size in the RCTs was 66. Conclusions We present an automated system for finding and categorising RCTs. This yields a novel resource: A database of structured information automatically extracted for all published RCTs in humans. We make daily updates of this database available on our website (trialstreamer.robotreviewer.net).

[1]  J. Habbema,et al.  Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. , 2001, Journal of clinical epidemiology.

[2]  H. Bastian,et al.  Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? , 2010, PLoS medicine.

[3]  Byron C. Wallace,et al.  Automating Risk of Bias Assessment for Clinical Trials , 2014, IEEE Journal of Biomedical and Health Informatics.

[4]  Neil R. Smalheiser,et al.  A probabilistic automated tagger to identify human-related publications , 2018, Database J. Biol. Databases Curation.

[5]  Dina Demner-Fushman,et al.  MetaMap Lite: an evaluation of a new Java implementation of MetaMap , 2017, J. Am. Medical Informatics Assoc..

[6]  Shlomo Argamon,et al.  Automatic Summarization of Results from Clinical Trials , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[7]  Julian PT Higgins,et al.  Machine learning to assist risk-of-bias assessments in systematic reviews , 2015, International journal of epidemiology.

[8]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[9]  Julian P. T. Higgins,et al.  Selecting Studies and Collecting Data , 2008 .

[10]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[11]  Byron C. Wallace,et al.  Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision , 2016, J. Mach. Learn. Res..

[12]  Junyi Jessy Li,et al.  A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature , 2018, ACL.

[13]  Dylan Kneale,et al.  Determining the scope of the review and the questions it will address , 2019, Cochrane Handbook for Systematic Reviews of Interventions.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  J. H. Bennett,et al.  Becoming an information master: a guidebook to the medical information jungle. , 1994, The Journal of family practice.

[16]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[17]  Byron C. Wallace,et al.  RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials , 2015, J. Am. Medical Informatics Assoc..

[18]  R. Haynes,et al.  Optimal search strategies for retrieving systematic reviews from Medline: analytical survey , 2004, BMJ : British Medical Journal.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  J. Higgins Cochrane handbook for systematic reviews of interventions. Version 5.1.0 [updated March 2011]. The Cochrane Collaboration , 2011 .

[21]  J. Sterne,et al.  The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials , 2011, BMJ : British Medical Journal.

[22]  Neil R. Smalheiser,et al.  Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach , 2017, J. Am. Medical Informatics Assoc..

[23]  I. Chalmers The Cochrane Collaboration: Preparing, Maintaining, and Disseminating Systematic Reviews of the Effects of Health Care , 1993, Annals of the New York Academy of Sciences.

[24]  Ye Zhang,et al.  Rationale-Augmented Convolutional Neural Networks for Text Classification , 2016, EMNLP.

[25]  Byron C. Wallace,et al.  Machine learning for identifying Randomized Controlled Trials: An evaluation and practitioner's guide , 2018, Research synthesis methods.

[26]  Philip S. Yu,et al.  Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine , 2015, J. Am. Medical Informatics Assoc..

[27]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[28]  Michele Tarsilla Cochrane Handbook for Systematic Reviews of Interventions , 2010, Journal of MultiDisciplinary Evaluation.

[29]  Byron C. Wallace,et al.  Toward systematic review automation: a practical guide to using machine learning tools in research synthesis , 2019, Systematic Reviews.