Active learning for imbalanced data under cold start

Modern systems that rely on Machine Learning (ML) for predictive modelling, may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios, where labels of the positive class take longer to accumulate. We propose an Active Learning (AL) system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where ODAL is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies without ODAL warm-up. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annotation budget or additional historical data (using just 2% to 10% of the labels).

[1]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Gianluca Bontempi,et al.  Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization , 2018, International Journal of Data Science and Analytics.

[3]  Reid A. Johnson,et al.  Calibrating Probability with Undersampling for Unbalanced Classification , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[4]  Hang Zhang,et al.  Online Active Learning Ensemble Framework for Drifted Data Streams , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[6]  Mark Craven,et al.  Multiple-Instance Active Learning , 2007, NIPS.

[7]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[9]  Kentaro Inui,et al.  Selective Sampling for Example-based Word Sense Disambiguation , 1998, CL.

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Yifan Zhang,et al.  Online Adaptive Asymmetric Active Learning With Limited Budgets , 2019, IEEE Transactions on Knowledge and Data Engineering.

[12]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[13]  Geoff Holmes,et al.  Active Learning with Evolving Streaming Data , 2011, ECML/PKDD.

[14]  Ana Sofia Gomes,et al.  Interleaved Sequence RNNs for Fraud Detection , 2020, KDD.

[15]  Jiang Wang,et al.  Feedback-driven multiclass active learning for data streams , 2013, CIKM.

[16]  Marco Loog,et al.  A benchmark and comparison of active learning for logistic regression , 2016, Pattern Recognit..

[17]  Eyke Hüllermeier,et al.  Aleatoric and Epistemic Uncertainty with Random Forests , 2020, IDA.

[18]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[19]  Shikha Mehta,et al.  Concept drift in Streaming Data Classification: Algorithms, Platforms and Issues , 2017, ITQM.

[20]  Nada Lavrac,et al.  Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the ClowdFlows platform , 2015, Inf. Process. Manag..

[21]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[22]  Jianneng Cao,et al.  Active learning for accurate analysis of streaming partial discharge data , 2015, 2015 IEEE Conference on Prognostics and Health Management (PHM).

[23]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[24]  Shai Shalev-Shwartz,et al.  Discriminative Active Learning , 2019, ArXiv.

[25]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[26]  Jie Tang,et al.  Active Learning for Streaming Networked Data , 2014, CIKM.

[27]  Zoubin Ghahramani,et al.  Cold-start Active Learning with Robust Ordinal Matrix Factorization , 2014, ICML.