A family of large margin linear classifiers and its application in dynamic environments

Real-time prediction problems pose a challenge to machine learning algorithms because learning must be fast, the set of classes may be changing, and the relevance of some features to each class may be changing. To learn robust classifiers in such nonstationary environments, it is essential not to assign too much weight to any single feature. We address this problem by combining regularization mechanisms with online large-margin learning algorithms. We prove bounds on their error and show that removing features with small weights has little influence on prediction accuracy, suggesting that these methods exhibit feature selection ability. We show that such regularized learning algorithms automatically decrease the influence of older training instances and focus on the more recent ones. This makes them especially attractive in dynamic environments. We evaluate our algorithms through experimental results on real data sets and through experiments with an online activity recognition system. The results show that these regularized large-margin methods adapt more rapidly to changing distributions and achieve lower overall error rates than state-of-the-art methods. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 328-345, 2009

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[3]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[4]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[5]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[8]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[9]  William W. Cohen,et al.  Single-pass online learning: performance, voting schemes and online feature selection , 2006, KDD '06.

[10]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[11]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[12]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[13]  Ohad Shamir,et al.  Learning to classify with missing and corrupted features , 2008, ICML.

[14]  H. Robbins A Stochastic Approximation Method , 1951 .

[15]  Michael W. Mahoney,et al.  Algorithmic and statistical challenges in modern largescale data analysis are the focus of MMDS 2008 , 2008, SKDD.

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[18]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[19]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Fixed Budget , 2005, NIPS.

[20]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[21]  Amir Globerson,et al.  Nightmare at test time: robust learning by feature deletion , 2006, ICML.

[22]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[23]  Thomas G. Dietterich,et al.  Detecting and correcting user activity switches: algorithms and interfaces , 2009, IUI.

[24]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..