Online Learning of Noisy Data

We study online learning of linear and kernel-based predictors, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially and change over time. We begin with the setting where some auxiliary information on the noise distribution is provided, and we wish to learn predictors with respect to the squared loss. Depending on the auxiliary information, we show how one can learn linear and kernel-based predictors, using just 1 or 2 noisy copies of each example. We then turn to discuss a general setting where virtually nothing is known about the noise distribution, and one wishes to learn with respect to general losses and using linear and kernel-based predictors. We show how this can be achieved using a random, essentially constant number of noisy copies of each example. Allowing multiple copies cannot be avoided: Indeed, we show that the setting becomes impossible when only one noisy copy of each instance can be accessed. To obtain our results we introduce several novel techniques, some of which might be of independent interest.

[1]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[2]  P. Ressel A short proof of Schoenberg’s theorem , 1976 .

[3]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[4]  A. Bose,et al.  Existence of unbiased estimates in sequential binomial experiments , 1990 .

[5]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[6]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[7]  Nicolò Cesa-Bianchi,et al.  Sample-efficient strategies for learning in the presence of noise , 1999, JACM.

[8]  Nader H. Bshouty,et al.  Uniform-distribution attribute noise learnability , 1999, COLT '99.

[9]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[10]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[11]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[12]  Hoeteck Wee,et al.  Toward Privacy in Public Databases , 2005, TCC.

[13]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[14]  Sally A. Goldman,et al.  Can PAC learning algorithms tolerate random attribute noise? , 1995, Algorithmica.

[15]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[16]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[17]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[18]  Andreas Christmann,et al.  Support Vector Machines , 2008, Data Mining and Knowledge Discovery Handbook.

[19]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[20]  Don R. Hush,et al.  Radial kernels and their reproducing kernel Hilbert spaces , 2010, J. Complex..

[21]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.