Structure Regularization for Structured Prediction: Theories and Experiments

While there are many studies on weight regularization, the study on structure regularization is rare. Many existing systems on structured prediction focus on increasing the level of structural dependencies within the model. However, this trend could have been misdirected, because our study suggests that complex structures are actually harmful to generalization ability in structured prediction. To control structure-based overfitting, we propose a structure regularization framework via \emph{structure decomposition}, which decomposes training samples into mini-samples with simpler structures, deriving a model with better generalization power. We show both theoretically and empirically that structure regularization can effectively control overfitting risk and lead to better accuracy. As a by-product, the proposed method can also substantially accelerate the training speed. The method and the theoretical results can apply to general graphical models with arbitrary structures. Experiments on well-known tasks demonstrate that our method can easily beat the benchmark systems on those highly-competitive tasks, achieving state-of-the-art accuracies yet with substantially faster training speed.

[1]  Xu Sun,et al.  Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection , 2012, ACL.

[2]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[3]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Ben London,et al.  PAC-Bayes Generalization Bounds for Randomized Structured Prediction , 2013 .

[6]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[7]  Mark W. Schmidt,et al.  Convex Structure Learning in Log-Linear Models: Beyond Pairwise Potentials , 2010, AISTATS.

[8]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[9]  Yoram Singer,et al.  Boosting with structural sparsity , 2009, ICML '09.

[10]  Ben Taskar,et al.  Collective Stability in Structured Prediction: Generalization from One Example , 2013, ICML.

[11]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[12]  Xu Sun,et al.  Large-Scale Personalized Human Activity Recognition Using Online Multitask Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[13]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.

[14]  Ben Taskar,et al.  Posterior vs Parameter Sparsity in Latent Variable Models , 2009, NIPS.

[15]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[16]  Xu Sun,et al.  Feature-Frequency–Adaptive On-line Training for Fast and Accurate Natural Language Processing , 2014, CL.

[17]  Charles A. Micchelli,et al.  A Spectral Regularization Framework for Multi-Task Structure Learning , 2007, NIPS.

[18]  Jianfeng Gao,et al.  A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing , 2007, ACL.

[19]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[20]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.

[21]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[22]  Martin J. Wainwright,et al.  Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting , 2006, J. Mach. Learn. Res..

[23]  Yusuke Miyao,et al.  Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models? , 2011, CoNLL.

[24]  Andrew McCallum,et al.  Piecewise pseudolikelihood for efficient training of conditional random fields , 2007, ICML '07.

[25]  Qiang Yang,et al.  Structural Regularized Support Vector Machine: A Framework for Structural Large Margin Classifier , 2011, IEEE Transactions on Neural Networks.

[26]  Xu Sun,et al.  Structure Regularization for Structured Prediction , 2014, NIPS.

[27]  Ohad Shamir,et al.  Learnability and Stability in the General Learning Setting , 2009, COLT.

[28]  Charles A. Micchelli,et al.  A Family of Penalty Functions for Structured Sparsity , 2010, NIPS.

[29]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[30]  Trevor Darrell,et al.  An efficient projection for {\it l}$_{\mbox{1}}$,$_{\mbox{infinity}}$ regularization , 2009, International Conference on Machine Learning.

[31]  Jun'ichi Tsujii,et al.  Reranking for Biomedical Named-Entity Recognition , 2007, BioNLP@ACL.

[32]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[33]  Trevor Darrell,et al.  An efficient projection for l 1 , infinity regularization. , 2009, ICML 2009.