Learning to identify relevant studies for systematic reviews using random forest and external information

We tackle the problem of automatically filtering studies while preparing Systematic Reviews (SRs) which normally entails manually inspecting thousands of studies to identify the few to be included. The problem is modeled as an imbalanced data classification task where the cost of misclassifying the minority class is higher than the cost of misclassifying the majority class. This work introduces a novel method for representing systematic reviews based not only on lexical features, but also utilizing word clustering and citation features. This novel representation is shown to outperform previously used features in representing systematic reviews, regardless of the classifier. Our work utilizes a random forest classifier with the novel features to accurately predict included studies with high recall. The parameters of the random forest are automatically configured using heuristics methods thus allowing us to provide a product that is usable in real scenarios. Experiments on a dataset containing 15 systematic reviews that were prepared by health care professionals show that our approach can achieve high recall while helping the SR author save time.

[1]  Luca Ardito,et al.  Linked data approach for selection process automation in systematic reviews , 2011, EASE.

[2]  Aaron M. Cohen,et al.  Optimizing Feature Representation for Automated Systematic Review Work Prioritization , 2008, AMIA.

[3]  Stan Matwin,et al.  A new algorithm for reducing the workload of experts in performing systematic reviews , 2010, J. Am. Medical Informatics Assoc..

[4]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[5]  Carla E. Brodley,et al.  Active learning for biomedical citation screening , 2010, KDD.

[6]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[7]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[8]  Stan Matwin,et al.  Classifying Biomedical Abstracts Using Committees of Classifiers and Collective Ranking Techniques , 2009, Canadian Conference on AI.

[9]  H. Bastian,et al.  Seventy-Five Trials and Eleven Systematic Reviews a Day: How Will We Ever Keep Up? , 2010, PLoS medicine.

[10]  Carla E. Brodley,et al.  Semi-automated screening of biomedical citations for systematic reviews , 2010, BMC Bioinformatics.

[11]  Siddhartha Jonnalagadda,et al.  A new iterative method to reduce workload in systematic review process , 2013, Int. J. Comput. Biol. Drug Des..

[12]  Nitesh V. Chawla,et al.  Consequences of Variability in Classifier Performance Estimates , 2010, 2010 IEEE International Conference on Data Mining.

[13]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[14]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[15]  Carla E. Brodley,et al.  Who Should Label What? Instance Allocation in Multiple Expert Active Learning , 2011, SDM.

[16]  Aaron M. Cohen,et al.  Letter: Performance of support-vector-machine-based classification on 15 systematic review topics evaluated with the WSS@95 measure , 2011, J. Am. Medical Informatics Assoc..

[17]  William R. Hersh,et al.  Reducing workload in systematic review preparation using automated citation classification. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[18]  J. Stockman How Quickly Do Systematic Reviews Go Out of Date? A Survival Analysis , 2009 .

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[21]  Sophia Ananiadou,et al.  Reducing systematic review workload through certainty-based screening , 2014, J. Biomed. Informatics.

[22]  Aaron M. Cohen,et al.  Studying the potential impact of automated document classification on scheduling a systematic review update , 2012, BMC Medical Informatics and Decision Making.

[23]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[24]  D. Cook,et al.  Systematic Reviews: Synthesis of Best Evidence for Clinical Decisions , 1997, Annals of Internal Medicine.

[25]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.