Application of dynamic logistic regression with unscented Kalman filter in predictive coding

Predictive coding, adapted from text categorization for litigation support, is an evolving process with identification of responsive documents and changing labeling decisions. The current state-of-art within predictive coding workflow uses Active Learning, where a new model is periodically rebuilt with additional documents reviewed, to continuously revise a model and improve the identification of responsive documents. We propose an alternative approach to recursively update the model using the Unscented Kalman Filter for each additional labeled document. With synthetic text streaming data and induced concept drift, we show that our approach learns new patterns at a faster rate, renders better accuracy and recall, and requires a reduced labeling cost, which when combined makes it potentially a better alternative in updating the model in the setting of Active Learning for predictive coding.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Peter Bailey,et al.  Relevance assessment: are judges exchangeable and does it matter , 2008, SIGIR '08.

[3]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[4]  Ludmila I. Kuncheva,et al.  Classifier Ensembles for Changing Environments , 2004, Multiple Classifier Systems.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Maura R. Grossman,et al.  Evaluation of machine-learning protocols for technology-assisted review in electronic discovery , 2014, SIGIR.

[7]  D. Katz Quantitative Legal Prediction – or – How I Learned to Stop Worrying and Start Preparing for the Data Driven Future of the Legal Services Industry , 2012 .

[8]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[9]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[10]  Maura R. Grossman,et al.  Inconsistent Responsiveness Determination in Document Review: Difference of Opinion or Human Error? , 2012 .

[11]  William D. Penny,et al.  Dynamic logistic regression , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[12]  Brian Mac Namee,et al.  Handling Concept Drift in a Text Data Stream Constrained by High Labelling Cost , 2010, FLAIRS.

[13]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[14]  Klaus-Robert Müller,et al.  Incremental Support Vector Learning: Analysis, Implementation and Applications , 2006, J. Mach. Learn. Res..

[15]  Stephen J. Roberts,et al.  Sequential Dynamic Classification Using Latent Variable Models , 2010, Comput. J..

[16]  Maura R. Grossman,et al.  Scalability of Continuous Active Learning for Reliable High-Recall Text Classification , 2016, CIKM.

[17]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..