When to Stop Reviewing in Technology-Assisted Reviews

Technology-Assisted Reviews (TAR) aim to expedite document reviewing (e.g., medical articles or legal documents) by iteratively incorporating machine learning algorithms and human feedback on document relevance. Continuous Active Learning (CAL) algorithms have demonstrated superior performance compared to other methods in efficiently identifying relevant documents. One of the key challenges for CAL algorithms is deciding when to stop displaying documents to reviewers. Existing work either lacks transparency—it provides an ad-hoc stopping point, without indicating how many relevant documents are still not found, or lacks efficiency by paying an extra cost to estimate the total number of relevant documents in the collection prior to the actual review. In this article, we handle the problem of deciding the stopping point of TAR under the continuous active learning framework by jointly training a ranking model to rank documents, and by conducting a “greedy” sampling to estimate the total number of relevant documents in the collection. We prove the unbiasedness of the proposed estimators under a with-replacement sampling design, while experimental results demonstrate that the proposed approach, similar to CAL, effectively retrieves relevant documents; but it also provides a transparent, accurate, and effective stopping point.

[1]  Maura R. Grossman,et al.  Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review , 2015, ArXiv.

[2]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2018 , 2018, CLEF.

[3]  Douglas W. Oard,et al.  Towards minimizing the annotation cost of certified text classification , 2013, CIKM.

[4]  Douglas W. Oard,et al.  Overview of the TREC 2010 Legal Track , 2010, TREC.

[5]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[6]  Stephen E. Robertson,et al.  Modelling Score Distributions Without Actual Scores , 2013, ICTIR.

[7]  J Allan,et al.  Readings in information retrieval. , 1998 .

[8]  Douglas W. Oard,et al.  Jointly Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-Discovery , 2018, ACM Trans. Inf. Syst..

[9]  Maura R. Grossman,et al.  Engineering Quality and Reliability in Technology-Assisted Review , 2016, SIGIR.

[10]  Maura R. Grossman,et al.  TREC 2016 Total Recall Track Overview , 2016, TREC.

[11]  Leif Azzopardi,et al.  CLEF 2018 Technologically Assisted Reviews in Empirical Medicine Overview , 2018, CLEF.

[12]  Michele Tarsilla Cochrane Handbook for Systematic Reviews of Interventions , 2010, Journal of MultiDisciplinary Evaluation.

[13]  Maura R. Grossman,et al.  Technology-Assisted Review in Empirical Medicine: Waterloo Participation in CLEF eHealth 2017 , 2017, CLEF.

[14]  Emine Yilmaz,et al.  Estimating average precision with incomplete and imperfect judgments , 2006, CIKM '06.

[15]  S. Ananiadou,et al.  Using text mining for study identification in systematic reviews: a systematic review of current approaches , 2015, Systematic Reviews.

[16]  Leif Azzopardi,et al.  CLEF 2019 Technology Assisted Reviews in Empirical Medicine Overview , 2019, CLEF.

[17]  David E. Losada,et al.  When to stop making relevance judgments? A study of stopping methods for building information retrieval test collections , 2018, J. Assoc. Inf. Sci. Technol..

[18]  Douglas W. Oard,et al.  Sequential testing in classifier evaluation yields biased estimates of effectiveness , 2013, SIGIR.

[19]  Giorgio Maria Di Nunzio A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews , 2018, ECIR.

[20]  D. Altman,et al.  Assessing Risk of Bias in Included Studies , 2008 .

[21]  Javed A. Aslam,et al.  Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions , 2007, ECIR.

[22]  Cyril Grouin,et al.  Overview of the CLEF eHealth Evaluation Lab 2015 , 2015, CLEF.

[23]  Carla E. Brodley,et al.  Active Literature Discovery for Scoping Evidence Reviews How Many Needles are There , 2013 .

[24]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[25]  Maura R. Grossman,et al.  Scalability of Continuous Active Learning for Reliable High-Recall Text Classification , 2016, CIKM.

[26]  Evangelos Kanoulas,et al.  Active Sampling for Large-scale Information Retrieval Evaluation , 2017, CIKM.

[27]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[28]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[29]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[30]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[31]  Maura R. Grossman,et al.  Beyond Pooling , 2018, SIGIR.

[32]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[33]  Emine Yilmaz,et al.  Measure-based metasearch , 2005, SIGIR '05.

[34]  Evangelos Kanoulas,et al.  Modeling the Score Distributions of Relevant and Non-relevant Documents , 2009, ICTIR.

[35]  Jimmy J. Lin,et al.  Sampling Strategies and Active Learning for Volume Estimation , 2016, SIGIR.

[36]  Stephen E. Robertson,et al.  Where to stop reading a ranked list?: threshold optimization using truncated score distributions , 2009, SIGIR.

[37]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[38]  Maura R. Grossman,et al.  Evaluation of machine-learning protocols for technology-assisted review in electronic discovery , 2014, SIGIR.

[39]  Carsten Eickhoff,et al.  Ranking and Feedback-based Stopping for Recall-Centric Document Retrieval , 2017, CLEF.

[40]  Falk Scholer,et al.  On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation , 2017, ACM Trans. Inf. Syst..