Learning Theory Estimates with Observations from General Stationary Stochastic Processes

This letter investigates the supervised learning problem with observations drawn from certain general stationary stochastic processes. Here by general, we mean that many stationary stochastic processes can be included. We show that when the stochastic processes satisfy a generalized Bernstein-type inequality, a unified treatment on analyzing the learning schemes with various mixing processes can be conducted and a sharp oracle inequality for generic regularized empirical risk minimization schemes can be established. The obtained oracle inequality is then applied to derive convergence rates for several learning schemes such as empirical risk minimization (ERM), least squares support vector machines (LS-SVMs) using given generic kernels, and SVMs using gaussian kernels for both least squares and quantile regression. It turns out that for independent and identically distributed (i.i.d.) processes, our learning rates for ERM recover the optimal rates. For non-i.i.d. processes, including geometrically -mixing Markov processes, geometrically -mixing processes with restricted decay, -mixing processes, and (time-reversed) geometrically -mixing processes, our learning rates for SVMs with gaussian kernels match, up to some arbitrarily small extra term in the exponent, the optimal rates. For the remaining cases, our rates are at least close to the optimal rates. As a by-product, the assumed generalized Bernstein-type inequality also provides an interpretation of the so-called effective number of observations for various mixing processes.

[1]  Small sample estimation of the variance of time‐averages in climatic time series , 1998 .

[2]  V. Volkonskii,et al.  Some Limit Theorems for Random Functions. II , 1959 .

[3]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[4]  Steven Kay,et al.  Gaussian Random Processes , 1978 .

[5]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization , 1997 .

[6]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[7]  John Odenckantz,et al.  Nonparametric Statistics for Stochastic Processes: Estimation and Prediction , 2000, Technometrics.

[8]  Pierre Alquier,et al.  Fast rates in learning with dependent observations , 2012, 1202.4283.

[9]  Nikolaos Limnios,et al.  Mathematical Statistics and Stochastic Processes: Bosq/Mathematical Statistics and Stochastic Processes , 2012 .

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  R. C. Bradley Basic properties of strong mixing conditions. A survey and some open questions , 2005, math/0511078.

[12]  Ingo Steinwart,et al.  Fast learning from α-mixing observations , 2014, J. Multivar. Anal..

[13]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[14]  Paul-Marie Samson,et al.  Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[15]  I. Ibragimov,et al.  Some Limit Theorems for Stationary Processes , 1962 .

[16]  Andreas Christmann,et al.  Fast Learning from Non-i.i.d. Observations , 2009, NIPS.

[17]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[18]  Ingo Steinwart,et al.  Optimal regression rates for SVMs using Gaussian kernels , 2013 .

[19]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems , 1997 .

[20]  E. Rio,et al.  Bernstein inequality and moderate deviations under strong mixing conditions , 2012, 1202.4777.

[21]  Y. Davydov Convergence of Distributions Generated by Stationary Stochastic Processes , 1968 .

[22]  J. M. Hammersley,et al.  The “Effective” Number of Independent Observations in an Autocorrelated Time Series , 1946 .

[23]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[24]  Di-Rong Chen,et al.  Learning rates of regularized regression for exponentially strongly mixing sequence , 2008 .

[25]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[26]  D. Bosq Bernstein-type large deviations inequalities for partial sums of strong mixing processes , 1993 .

[27]  Ingo Steinwart,et al.  Estimating conditional quantiles with the help of the pinball loss , 2011, 1102.2101.

[28]  M. Rosenblatt A CENTRAL LIMIT THEOREM AND A STRONG MIXING CONDITION. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Gilles Blanchard,et al.  On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[30]  Dharmendra S. Modha,et al.  Minimum complexity regression estimation with weakly dependent observations , 1996, IEEE Trans. Inf. Theory.

[31]  Ingo Steinwart Some oracle inequalities for regularized boosting classifiers , 2009 .

[32]  An Zi EFFECTIVE NUMBER OF OBSERVATIONS AND UNBIASED ESTIMATORS OF VARIANCE FOR AUTOCORRELATED DATA − AN OVERVIEW , 2010 .

[33]  Yunlong Feng,et al.  Least-squares regularized regression with dependent samples and q-penalty , 2012 .

[34]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[35]  A. Yaglom Correlation Theory of Stationary and Related Random Functions I: Basic Results , 1987 .

[36]  N. Wermuth,et al.  Nonlinear Time Series: Nonparametric and Parametric Methods , 2005 .

[37]  Qiang Wu,et al.  A note on application of integral operator in learning theory , 2009 .

[38]  A. Nobel Limits to classification and regression estimation from ergodic processes , 1999 .

[39]  Ingo Steinwart,et al.  A Bernstein-type Inequality for Some Mixing Processes and Dynamical Systems with an Application to Learning , 2015, 1501.03059.

[40]  A E Bostwick,et al.  THE THEORY OF PROBABILITIES. , 1896, Science.

[41]  R. Adamczak A tail inequality for suprema of unbounded empirical processes with applications to Markov chains , 2007, 0709.3110.

[42]  David Lubman Spatial Averaging in a Diffuse Sound Field , 1969 .

[43]  Richard C. Bradley,et al.  Introduction to strong mixing conditions , 2007 .

[44]  F. Barthe Learning from Dependent Observations , 2006 .

[45]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[46]  M. Mohri,et al.  Stability Bounds for Stationary φ-mixing and β-mixing Processes , 2010 .

[47]  Don R. Hush,et al.  An Explicit Description of the Reproducing Kernel Hilbert Spaces of Gaussian RBF Kernels , 2006, IEEE Transactions on Information Theory.

[48]  V. Maume-Deschamps EXPONENTIAL INEQUALITIES AND FUNCTIONAL ESTIMATIONS FOR WEAK DEPENDENT DATA: APPLICATIONS TO DYNAMICAL SYSTEMS , 2006, math/0604214.

[49]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[50]  Hongwei Sun,et al.  Regularized least square regression with dependent samples , 2010, Adv. Comput. Math..

[51]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[52]  Michael I. Jordan,et al.  Ergodic mirror descent , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[53]  Andreas Christmann,et al.  How SVMs can estimate quantiles and the median , 2007, NIPS.