PAC-Bayesian Collective Stability

Recent results have shown that the generalization error of structured predictors decreases with both the number of examples and the size of each example, provided the data distribution has weak dependence and the predictor exhibits a smoothness property called collective stability. These results use an especially strong denition of collective stability that must hold uniformly over all inputs and all hypotheses in the class. We investigate whether weaker denitions of collective stability suce. Using the PAC-Bayes framework, which is particularly amenable to our new denitions, we prove that generalization is indeed possible when uniform collective stability happens with high probability over draws of predictors (and inputs). We then derive a generalization bound for a class of structured predictors with variably convex inference, which suggests a novel learning objective that optimizes collective stability.

[1]  John Shawe-Taylor,et al.  Tighter PAC-Bayes Bounds , 2006, NIPS.

[2]  Mehryar Mohri,et al.  Rademacher Complexity Bounds for Non-I.I.D. Processes , 2008, NIPS.

[3]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[4]  Cosma Rohilla Shalizi,et al.  Generalization error bounds for stationary autoregressive models , 2011, ArXiv.

[5]  Lise Getoor,et al.  Hinge-loss Markov Random Fields: Convex Inference for Structured Prediction , 2013, UAI.

[6]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[7]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[8]  Paul-Marie Samson,et al.  Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[9]  Liva Ralaivola,et al.  Chromatic PAC-Bayes Bounds for Non-IID Data , 2009, AISTATS.

[10]  Massih-Reza Amini,et al.  Generalization error bounds for classifiers trained with interdependent data , 2005, NIPS.

[11]  P. Collet,et al.  Concentration inequalities for random fields via coupling , 2005, math/0503483.

[12]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[13]  Ben Taskar,et al.  Collective Stability in Structured Prediction: Generalization from One Example , 2013, ICML.

[14]  Doris Fiebig Mixing properties of a class of Bernoulli-processes , 1993 .

[15]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[16]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[17]  K. Ramanan,et al.  Concentration Inequalities for Dependent Random Variables via the Martingale Method , 2006, math/0609835.

[18]  Tamir Hazan,et al.  PAC-Bayesian approach for minimization of phoneme error rate , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[20]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[21]  L. Ralaivola,et al.  Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes , 2010 .

[22]  Pierre Alquier,et al.  Model selection for weakly dependent time series forecasting , 2009, 0902.2924.

[23]  L. Kontorovich Obtaining Measure Concentration from Markov Contraction , 2007, 0711.0987.

[24]  Gökhan BakIr,et al.  Generalization Bounds and Consistency for Structured Labeling , 2007 .

[25]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[26]  Adnan Darwiche,et al.  On the Robustness of Most Probable Explanations , 2006, UAI.

[27]  Tommi S. Jaakkola,et al.  Convergence Rate Analysis of MAP Coordinate Minimization Algorithms , 2012, NIPS.

[28]  S. Kutin Extensions to McDiarmid's inequality when dierences are bounded with high probability , 2002 .

[29]  John Shawe-Taylor,et al.  Distribution-Dependent PAC-Bayes Priors , 2010, ALT.

[30]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[31]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[32]  Van H. Vu,et al.  Concentration of non‐Lipschitz functions and applications , 2002, Random Struct. Algorithms.

[33]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[34]  M. Mohri,et al.  Stability Bounds for Stationary φ-mixing and β-mixing Processes , 2010 .

[35]  Jean Honorio Lipschitz Parametrization of Probabilistic Graphical Models , 2011, UAI.

[36]  John Shawe-Taylor,et al.  PAC Bayes and Margins , 2003 .

[37]  Jennifer Neville,et al.  Dependency networks for relational data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[38]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[39]  John Shawe-Taylor,et al.  PAC-Bayesian Inequalities for Martingales , 2011, IEEE Transactions on Information Theory.

[40]  Tamir Hazan,et al.  A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction , 2010, NIPS.

[41]  Ben Taskar,et al.  Large margin methods for structured classification : Exponentiated gradient algorithms and PAC-Bayesian generalization bounds , 2004 .

[42]  Martin J. Wainwright,et al.  Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting , 2006, J. Mach. Learn. Res..

[43]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .