论文信息 - Feature Noising for Log-Linear Structured Prediction - 字舞流文

Feature Noising for Log-Linear Structured Prediction

NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a >1% absolute performance gain over use of standardL2 regularization.

Christopher D. Manning | Mengqiu Wang | Sida I. Wang | Stefan Wager | Percy Liang | Percy Liang | Stefan Wager | Mengqiu Wang

[1] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.

[2] Stephen Tyree,et al. Learning with Marginalized Corrupted Features , 2013, ICML.

[3] Francis K. H. Quek,et al. Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets , 2003, Pattern Recognit..

[4] Dale Schuurmans,et al. Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[5] Gideon S. Mann,et al. Simple, robust, scalable semi-supervised learning via expectation regularization , 2007, ICML '07.

[6] Yoshua Bengio,et al. Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[7] Christopher D. Manning,et al. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[8] Christopher M. Bishop,et al. Current address: Microsoft Research, , 2022 .

[9] Bernhard Schölkopf,et al. Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[10] Andrew McCallum,et al. Feature Bagging: Preventing Weight Undertraining in Structured Discriminative Learning , 2005 .

[11] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[12] Christopher D. Manning,et al. Fast dropout training , 2013, ICML.

[13] Erik F. Tjong Kim Sang,et al. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[14] Pascal Vincent,et al. The Manifold Tangent Classifier , 2011, NIPS.

[15] Slav Petrov,et al. Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[16] Yoshua Bengio,et al. Entropy Regularization , 2006, Semi-Supervised Learning.

[17] Pascal Vincent,et al. Adding noise to the input of a model trained with a regularized objective , 2011, ArXiv.

[18] Wei Li,et al. Semi-Supervised Sequence Modeling with Syntactic Topic Models , 2005, AAAI.

[19] Yaser S. Abu-Mostafa,et al. Learning from hints in neural networks , 1990, J. Complex..

[20] Kiyotoshi Matsuoka,et al. Noise injection into inputs in back-propagation learning , 1992, IEEE Trans. Syst. Man Cybern..

[21] Trevor Cohn,et al. Logarithmic Opinion Pools for Conditional Random Fields , 2005, ACL.

[22] M. Kenward,et al. An Introduction to the Bootstrap , 2007 .

[23] Sida I. Wang,et al. Dropout Training as Adaptive Regularization , 2013, NIPS.

[24] Yann LeCun,et al. Transformation invariance in pattern recognition: Tangent distance and propagation , 2000, Int. J. Imaging Syst. Technol..

[25] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.