One-Pass Logistic Regression for Label-Drift and Large-Scale Classification on Distributed Systems

Logistic regression (LR) for classification is the workhorse in industry, where a set of predefined classes is required. The model, however, fails to work in the case where the class labels are not known in advance, a problem we term label-drift classification. Label-drift classification problem naturally occurs in many applications, especially in the context of streaming settings where the incoming data may contain samples categorized with new classes that have not been previously seen. Additionally, in the wave of big data, traditional LR methods may fail due to their expense of running time. In this paper, we introduce a novel variant of LR, namely one-pass logistic regression (OLR) to offer a principled treatment for label-drift and large-scale classifications. To handle largescale classification for big data, we further extend our OLR to a distributed setting for parallelization, termed sparkling OLR (Spark-OLR). We demonstrate the scalability of our proposed methods on large-scale datasets with more than one hundred million data points. The experimental results show that the predictive performances of our methods are comparable orbetter than those of state-of-the-art baselines whilst the executiontime is much faster at an order of magnitude. In addition, the OLR and Spark-OLR are invariant to data shuffling and have no hyperparameter to tune that significantly benefits data practitioners and overcomes the curse of big data cross-validationto select optimal hyperparameters.

[1]  Trung Le,et al.  Distributed data augmented support vector machine on Spark , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[2]  Trung Le,et al.  Sparkling Vector Machines , 2015 .

[3]  Indre Zliobaite,et al.  Learning under Concept Drift: an Overview , 2010, ArXiv.

[4]  Xiaoli Z. Fern,et al.  Multi-instance multi-label learning in the presence of novel class instances , 2015, ICML.

[5]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[6]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[7]  D. Cox The Regression Analysis of Binary Sequences , 2017 .

[8]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[9]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[10]  Steven C. H. Hoi,et al.  LIBOL: a library for online learning algorithms , 2014, J. Mach. Learn. Res..

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[13]  S. Canu,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[14]  Xin Yao,et al.  DDD: A New Ensemble Approach for Dealing with Concept Drift , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[16]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[17]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[18]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.