Nonstationary Nonparametric Online Learning: Balancing Dynamic Regret and Model Parsimony

An open challenge in supervised learning is \emph{conceptual drift}: a data point begins as classified according to one label, but over time the notion of that label changes. Beyond linear autoregressive models, transfer and meta learning address drift, but require data that is representative of disparate domains at the outset of training. To relax this requirement, we propose a memory-efficient \emph{online} universal function approximator based on compressed kernel methods. Our approach hinges upon viewing non-stationary learning as online convex optimization with dynamic comparators, for which performance is quantified by dynamic regret. Prior works control dynamic regret growth only for linear models. In contrast, we hypothesize actions belong to reproducing kernel Hilbert spaces (RKHS). We propose a functional variant of online gradient descent (OGD) operating in tandem with greedy subspace projections. Projections are necessary to surmount the fact that RKHS functions have complexity proportional to time. For this scheme, we establish sublinear dynamic regret growth in terms of both loss variation and functional path length, and that the memory of the function sequence remains moderate. Experiments demonstrate the usefulness of the proposed technique for online nonlinear regression and classification problems with non-stationary data.

[1]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[2]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[3]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Koby Crammer,et al.  Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training , 2012, J. Mach. Learn. Res..

[5]  Lei Xu,et al.  Input Convex Neural Networks : Supplementary Material , 2017 .

[6]  Shahin Shahrampour,et al.  Online Optimization : Competing with Dynamic Comparators , 2015, AISTATS.

[7]  V. Tikhomirov On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of one Variable and Addition , 1991 .

[8]  Sebastian Thrun,et al.  Learning to Learn , 1998, Springer US.

[9]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[10]  Alejandro Ribeiro,et al.  Parsimonious Online Learning with Kernels via sparse projections in function space , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[12]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[13]  Vivek S. Borkar,et al.  Stochastic approximation with 'controlled Markov' noise , 2006, Systems & control letters (Print).

[14]  Aryan Mokhtari,et al.  Optimization in Dynamic Environments : Improved Regret Rates for Strongly Convex Problems , 2016 .

[15]  James R. Zeidler,et al.  Adaptive tracking of linear time-variant systems by extended RLS algorithms , 1997, IEEE Trans. Signal Process..

[16]  Karl Johan Åström,et al.  BOOK REVIEW SYSTEM IDENTIFICATION , 1994, Econometric Theory.

[17]  Georgios B. Giannakis,et al.  Random Feature-based Online Multi-kernel Learning in Environments with Unknown Dynamics , 2017, J. Mach. Learn. Res..

[18]  Sergey Levine,et al.  Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL , 2018, ICLR.

[19]  James M. Rehg,et al.  Learning Visual Object Categories for Robot Affordance Prediction , 2010, Int. J. Robotics Res..

[20]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[21]  W. Rudin Principles of mathematical analysis , 1964 .

[22]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[23]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[24]  Ketan Rajawat,et al.  Tracking Moving Agents via Inexact Online Gradient Descent Algorithm , 2017, IEEE Journal of Selected Topics in Signal Processing.

[25]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[26]  Ah Chung Tsoi,et al.  Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results , 1998, Neural Networks.

[27]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[28]  L. Wasserman All of Nonparametric Statistics , 2005 .

[29]  A. Zeevi,et al.  Non-Stationary Stochastic Optimization , 2014 .

[30]  Andrea Simonetto Time-Varying Convex Optimization via Time-Varying Averaged Operators , 2017, 1704.07338.

[31]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[32]  Aryan Mokhtari,et al.  A Class of Prediction-Correction Methods for Time-Varying Convex Optimization , 2015, IEEE Transactions on Signal Processing.

[33]  Ben Taskar,et al.  Online, self-supervised terrain classification via discriminatively trained submodular Markov random fields , 2008, 2008 IEEE International Conference on Robotics and Automation.

[34]  Shalabh Bhatnagar,et al.  Two Timescale Stochastic Approximation with Controlled Markov noise , 2015, Math. Oper. Res..

[35]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[36]  A. Paulraj,et al.  A simple scheme for transmit diversity using partial channel feedback , 1998, Conference Record of Thirty-Second Asilomar Conference on Signals, Systems and Computers (Cat. No.98CH36284).

[37]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[38]  Alec Koppel,et al.  Consistent online Gaussian process regression without the sample complexity bottleneck , 2019, Statistics and Computing.

[39]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[40]  Rebecca Willett,et al.  Online Convex Optimization in Dynamic Environments , 2015, IEEE Journal of Selected Topics in Signal Processing.

[41]  Sergey Levine,et al.  Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate any Learning Algorithm , 2017, ICLR.

[42]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[43]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[44]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[45]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[46]  H. Akaike Fitting autoregressive models for prediction , 1969 .

[47]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[48]  Trung Le,et al.  Nonparametric Budgeted Stochastic Gradient Descent , 2016, AISTATS.

[49]  Shie Mannor,et al.  The kernel recursive least-squares algorithm , 2004, IEEE Transactions on Signal Processing.

[50]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[51]  D. Brillinger Time series - data analysis and theory , 1981, Classics in applied mathematics.