Automating Document Classification with Distant Supervision to Increase the Efficiency of Systematic Reviews

Objective: Systematic reviews of scholarly documents often provide complete and exhaustive summaries of literature relevant to a research question. However, well-done systematic reviews are expensive, time-demanding, and labor-intensive. Here, we propose an automatic document classification approach to significantly reduce the effort in reviewing documents. Methods: We first describe a manual document classification procedure that is used to curate a pertinent training dataset and then propose three classifiers: a keyword-guided method, a cluster analysis-based refined method, and a random forest approach that utilizes a large set of feature tokens. As an example, this approach is used to identify documents studying female sex workers that are assumed to contain content relevant to either HIV or violence. We compare the performance of the three classifiers by cross-validation and conduct a sensitivity analysis on the portion of data utilized in training the model. Results: The random forest approach provides the highest area under the curve (AUC) for both receiver operating characteristic (ROC) and precision/recall (PR). Analyses of precision and recall suggest that random forest could facilitate manually reviewing 20\% of the articles while containing 80\% of the relevant cases. Finally, we found a good classifier could be obtained by using a relatively small training sample size. Conclusions: In sum, the automated procedure of document classification presented here could improve both the precision and efficiency of systematic reviews, as well as facilitating live reviews, where reviews are updated regularly.

[1]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[2]  Michal Mechura Data Structures in Lexicography: from Trees to Graphs , 2016, RASLAN.

[3]  David M. Mimno,et al.  Comparing Apples to Apple: The Effects of Stemmers on Topic Models , 2016, TACL.

[4]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[5]  Pauline V. Angione,et al.  On the equivalence of boolean and weighted searching based on the convertibility of query forms , 1975, J. Am. Soc. Inf. Sci..

[6]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[7]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[8]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[9]  Alastair D Hay,et al.  Effect of antibiotic prescribing in primary care on antimicrobial resistance in individual patients: systematic review and meta-analysis , 2010, BMJ : British Medical Journal.

[10]  T. Vos,et al.  Global burden of disease attributable to mental and substance use disorders: findings from the Global Burden of Disease Study 2010 , 2013, The Lancet.

[11]  Theo Vos,et al.  The Long-Term Health Consequences of Child Physical Abuse, Emotional Abuse, and Neglect: A Systematic Review and Meta-Analysis , 2012, PLoS medicine.

[12]  D. Moher,et al.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. , 2010, International journal of surgery.

[13]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[14]  G. Antes,et al.  Five Steps to Conducting a Systematic Review , 2003, Journal of the Royal Society of Medicine.