Batch-mode active learning for technology-assisted review

In recent years, technology-assisted review (TAR) has become an increasingly important component of the document review process in litigation discovery. This is fueled largely by dramatic growth in data volumes that may be associated with many matters and investigations. Potential review populations frequently exceed several hundred thousands documents, and document counts in the millions are not uncommon. Budgetary and/or time constraints often make a once traditional linear review of these populations impractical, if not impossible - which made "predictive coding" the most discussed TAR approach in recent years. A key challenge in any predictive coding approach is striking the appropriate balance in training the system. The goal is to minimize the time that Subject Matter Experts spend in training the system, while making sure that they perform enough training to achieve acceptable classification performance over the entire review population. Recent research demonstrates that Support Vector Machines (SVM) perform very well in finding a compact, yet effective, training dataset in an iterative fashion using batch-mode active learning. However, this research is limited. Additionally, these efforts have not led to a principled approach for determining the stabilization of the active learning process. In this paper, we propose and compare several batch-mode active learning methods which are integrated within SVM learning algorithm. We also propose methods for determining the stabilization of the active learning method. Experimental results on a set of large-scale, real-life legal document collections validate the superiority of our method over the existing methods for this task.

[1]  Sanjoy Dasgupta,et al.  Coarse sample complexity bounds for active learning , 2005, NIPS.

[2]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[3]  Jingbo Zhu,et al.  Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem , 2007, EMNLP.

[4]  Michael Bloodgood,et al.  Analysis of Stopping Active Learning based on Stabilizing Predictions , 2013, CoNLL.

[5]  Maura R. Grossman,et al.  Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review , 2011 .

[6]  Amin Saberi,et al.  Stochastic Submodular Maximization , 2008, WINE.

[7]  Masood Ghayoomi,et al.  Using Variance as a Stopping Criterion for Active Learning of Frame Assignment , 2010, HLT-NAACL 2010.

[8]  Hideki Takeda,et al.  When to stop reviewing documents in eDiscovery cases: the Lit i View quality monitor and endpoint detector , 2013, MEDES.

[9]  K. Vijay-Shanker,et al.  A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping , 2009, CoNLL.

[10]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[11]  Andreas Krause,et al.  Near-optimal Batch Mode Active Learning and Adaptive Submodular Optimization , 2013, ICML.

[12]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[13]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[14]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[15]  Jingbo Zhu,et al.  Multi-Criteria-Based Strategy to Stop Active Learning for Data Annotation , 2008, COLING.

[16]  Maura R. Grossman,et al.  Evaluation of machine-learning protocols for technology-assisted review in electronic discovery , 2014, SIGIR.

[17]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[18]  Robert D. Nowak,et al.  Noisy Generalized Binary Search , 2009, NIPS.

[19]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[20]  Maura R. Grossman,et al.  Inconsistent Responsiveness Determination in Document Review: Difference of Opinion or Human Error? , 2012 .

[21]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[22]  Jingbo Zhu,et al.  Learning a Stopping Criterion for Active Learning for Word Sense Disambiguation and Text Classification , 2008, IJCNLP.

[23]  Herbert L. Roitblat,et al.  Document categorization in legal electronic discovery: computer classification vs. manual review , 2010, J. Assoc. Inf. Sci. Technol..

[24]  Jeff A. Bilmes,et al.  Active Semi-Supervised Learning using Submodular Functions , 2011, UAI.

[25]  Hinrich Schütze,et al.  Stopping Criteria for Active Learning of Named Entity Recognition , 2008, COLING.

[26]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[27]  Andreas Krause,et al.  Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization , 2010, COLT 2010.

[28]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[29]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[30]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[31]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[32]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[33]  David Sharpe,et al.  The Challenge and Promise of Predictive Coding for Privilege , 2013 .

[34]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[35]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[36]  Gerard Salton,et al.  Improving Retrieval Performance by Relevance Feedback , 1997 .