Fast learning from α-mixing observations

We present a new oracle inequality for generic regularized empirical risk minimization algorithms learning from stationary α-mixing processes. Our main tool to derive this inequality is a rather involved version of the so-called peeling method. We then use this oracle inequality to derive learning rates for some learning methods such as empirical risk minimization (ERM), least squares support vector machines (SVMs) using given generic kernels, and SVMs using the Gaussian RBF kernels for both least squares and quantile regression. It turns out that for i.i.d. processes our learning rates for ERM and SVMs with Gaussian kernels match, up to some arbitrarily small extra term in the exponent, the optimal rates, while in the remaining cases our rates are at least close to the optimal rates.

[1]  Hongwei Sun,et al.  Regularized Least Square Regression with Unbounded and Dependent Sampling , 2013 .

[2]  Ingo Steinwart Some oracle inequalities for regularized boosting classifiers , 2009 .

[3]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[4]  S. Geer Empirical Processes in M-Estimation , 2000 .

[5]  Ingo Steinwart,et al.  Estimating conditional quantiles with the help of the pinball loss , 2011, 1102.2101.

[6]  Luoqing Li,et al.  The performance bounds of learning machines based on exponentially strongly mixing sequences , 2007, Comput. Math. Appl..

[7]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[8]  I. A. Ibragimov,et al.  Gaussian Random Processes. Part 2 , 1977 .

[9]  Hongwei Sun,et al.  Regularized least square regression with dependent samples , 2010, Adv. Comput. Math..

[10]  V. Volkonskii,et al.  Some Limit Theorems for Random Functions. II , 1959 .

[11]  Marian Anghel,et al.  Support Vector Machines for Forecasting the Evolution of an Unknown Ergodic Dynamical System from Observations with Unknown Noise , 2022 .

[12]  Sanjeev R. Kulkarni,et al.  Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations , 2005, NIPS.

[13]  I. Ibragimov,et al.  Some Limit Theorems for Stationary Processes , 1962 .

[14]  Felipe Cucker,et al.  Learning Theory: An Approximation Theory Viewpoint: Index , 2007 .

[15]  M. Mohri,et al.  Stability Bounds for Stationary φ-mixing and β-mixing Processes , 2010 .

[16]  Yunlong Feng,et al.  Least-squares regularized regression with dependent samples and q-penalty , 2012 .

[17]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[18]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[19]  Richard A. Davis,et al.  A Central Limit Theorem and a Strong Mixing Condition , 2011 .

[20]  Andreas Christmann,et al.  Fast Learning from Non-i.i.d. Observations , 2009, NIPS.

[21]  Ron Meir,et al.  Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.

[22]  Gilles Blanchard,et al.  On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[23]  Zongben Xu,et al.  The generalization performance of ERM algorithm with strongly mixing observations , 2009, Machine Learning.

[24]  Richard C. Bradley,et al.  Introduction to strong mixing conditions , 2007 .

[25]  Mehryar Mohri,et al.  Rademacher Complexity Bounds for Non-I.I.D. Processes , 2008, NIPS.

[26]  E. Masry,et al.  Minimum complexity regression estimation with weakly dependent observations , 1996, Proceedings of 1994 Workshop on Information Theory and Statistics.

[27]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[28]  Mehryar Mohri,et al.  Stability Bounds for Non-i.i.d. Processes , 2007, NIPS.

[29]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[30]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[31]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[32]  Jianqing Fan Nonlinear Time Series , 2003 .

[33]  Alon Orlitsky,et al.  On Nearest-Neighbor Error-Correcting Output Codes with Application to All-Pairs Multiclass Support Vector Machines , 2003, J. Mach. Learn. Res..

[34]  Ingo Steinwart,et al.  Oracle inequalities for support vector machines that are based on random entropy numbers , 2009, J. Complex..

[35]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems , 1997 .

[36]  Qiang Wu,et al.  A note on application of integral operator in learning theory , 2009 .

[37]  A. Nobel Limits to classification and regression estimation from ergodic processes , 1999 .

[38]  Lei Shi,et al.  Classification with non-i.i.d. sampling , 2011, Math. Comput. Model..

[39]  Di-Rong Chen,et al.  Learning rates of regularized regression for exponentially strongly mixing sequence , 2008 .

[40]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[41]  Don R. Hush,et al.  Learning from dependent observations , 2007, J. Multivar. Anal..

[42]  Zhi-Wei Pan,et al.  Least-square regularized regression with non-iid sampling , 2009 .

[43]  Ingo Steinwart,et al.  Optimal regression rates for SVMs using Gaussian kernels , 2013 .

[44]  Yiming Ying,et al.  Learning Rates of Least-Square Regularized Regression , 2006, Found. Comput. Math..