Predicting Publication Inclusion for Diagnostic Accuracy Test Reviews Using Random Forests and Topic Modelling

Finding all relevant publications to perform a systematic review can be a time consuming task, especially in the field of diagnostic test accuracy. Therefore, the CLEF eHealth lab ‘technologically assisted reviews in empirical medicine’ was established to create a basis of comparison between various methods. In this paper we describe a method submitted to the lab. This method consists of a topic model used to extract features and a random forest to classify the relevant papers. Classifier performance shows and average decrease of 33.3% in workload (i.e., documents to read) when aiming for a 95% recall and 24.9% for 100% recall. However, there is a large variety in workload reduction (79.3% to 0.9%) between the diagnostic test accuracy reviews.

[1]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[2]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[3]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Jeffrey Heer,et al.  Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment , 2013, ICML.

[6]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[7]  Maura R. Grossman,et al.  Engineering Quality and Reliability in Technology-Assisted Review , 2016, SIGIR.

[8]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[9]  Sophia Ananiadou,et al.  Supporting systematic reviews using LDA-based document representations , 2015, Systematic Reviews.

[10]  Dina Demner-Fushman,et al.  Feature Engineering and a Proposed Decision-Support System for Systematic Reviewers of Medical Evidence , 2014, PloS one.

[11]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[12]  S. Ananiadou,et al.  Using text mining for study identification in systematic reviews: a systematic review of current approaches , 2015, Systematic Reviews.

[13]  R. Dessau,et al.  The diagnostic accuracy of serological tests for Lyme borreliosis in Europe: a systematic review and meta-analysis , 2016, BMC Infectious Diseases.

[14]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[15]  Leif Azzopardi,et al.  CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview , 2018, CLEF.

[16]  Aeilko H. Zwinderman,et al.  Understanding big data themes from scientific biomedical literature through topic modeling , 2016, Journal of Big Data.

[17]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .