Active learning for online training in imbalanced data streams under cold start

Labeled data is essential in modern systems that rely on Machine Learning (ML) for predictive modelling. Such systems may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios. Online financial fraud detection is an example where labeling is: i) expensive, or ii) it suffers from long delays, if relying on victims filing complaints. The latter may not be viable if a model has to be in place immediately, so an option is to ask analysts to label events while minimizing the number of annotations to control costs. We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where it is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annotation budget or additional historical data (with 1/10 to 1/50 of the labels).

[1]  Zoubin Ghahramani,et al.  Cold-start Active Learning with Robust Ordinal Matrix Factorization , 2014, ICML.

[2]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[3]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Kentaro Inui,et al.  Selective Sampling for Example-based Word Sense Disambiguation , 1998, CL.

[6]  Yifan Zhang,et al.  Online Adaptive Asymmetric Active Learning With Limited Budgets , 2019, IEEE Transactions on Knowledge and Data Engineering.

[7]  David Oliveira Aparício,et al.  Machine learning methods to detect money laundering in the bitcoin blockchain in the presence of label scarcity , 2020, ICAIF.

[8]  Shikha Mehta,et al.  Concept drift in Streaming Data Classification: Algorithms, Platforms and Issues , 2017, ITQM.

[9]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[10]  Marco Loog,et al.  A benchmark and comparison of active learning for logistic regression , 2016, Pattern Recognit..

[11]  Shai Shalev-Shwartz,et al.  Discriminative Active Learning , 2019, ArXiv.

[12]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[13]  Jie Tang,et al.  Active Learning for Streaming Networked Data , 2014, CIKM.

[14]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[15]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[16]  Nada Lavrac,et al.  Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the ClowdFlows platform , 2015, Inf. Process. Manag..

[17]  Jiang Wang,et al.  Feedback-driven multiclass active learning for data streams , 2013, CIKM.

[18]  Geoff Holmes,et al.  Active Learning with Evolving Streaming Data , 2011, ECML/PKDD.

[19]  Ana Sofia Gomes,et al.  Interleaved Sequence RNNs for Fraud Detection , 2020, KDD.

[20]  Hang Zhang,et al.  Online Active Learning Ensemble Framework for Drifted Data Streams , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Eyke Hüllermeier,et al.  Aleatoric and Epistemic Uncertainty with Random Forests , 2020, IDA.

[22]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[23]  Jianneng Cao,et al.  Active learning for accurate analysis of streaming partial discharge data , 2015, 2015 IEEE Conference on Prognostics and Health Management (PHM).

[24]  Gianluca Bontempi,et al.  Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization , 2018, International Journal of Data Science and Analytics.

[25]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[26]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[27]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[28]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[29]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[30]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[31]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.