Automatic Thresholding by Sampling Documents and Estimating Recall

In this paper, we describe the participation of the Information and Language Processing System (ILPS) group at CLEF eHealth 2019 Task 2.2: Technologically Assisted Reviews in Empirical Medicine. This task is targeted to produce an efficient ordering of the documents and to identify a subset of the documents which contains as many of the relevant abstracts for the least effort. Participants are provided with systematic review topics with each including a review title, a boolean query constructed by Cochrane experts, and a set of PubMed Document Identifiers (PID's) returned by running the boolean query in MEDLINE. We handle the problem under the Continuous Active Learning framework by jointly training a ranking model to rank documents, and conducting a “greedy” sampling to estimate the real number of relevant documents in the collection. We finally submitted four runs.