Data Balancing for Technologically Assisted Reviews: Undersampling or Reweighting

This paper provides approaches for automated support of citation screening in systematic reviews. Continuous active learning is chosen as our baseline approach, above which, two data balancing techniques are applied to handle the imbalance problem. These two techniques, aggressive undersampling and reweighting are tested and compared on 20 data sets for Diagnostic Test Accuracy (DTA) reviews. Results are evaluated by last rel and suggest that reweighting outperforms undersampling as it not only balances the training data, but also emphasizes the “content relevant” examples over “abstract relevant” ones and thus helps to retrieve “content relevant” papers earlier.