Improving Recurrent Neural Networks with Predictive Propagation for Sequence Labelling

Recurrent neural networks (RNNs) is a useful tool for sequence labelling tasks in natural language processing. Although in practice RNNs suffer a problem of vanishing/exploding gradient, their compactness still offers efficiency and make them less prone to overfitting. In this paper we show that by propagating the prediction of previous labels we can improve the performance of RNNs while keeping the number of parameters in RNNs unchanged and adding only one more step for inference. As a result, the models are still more compact and efficient than other models with complex memory gates. In the experiment, we evaluate the idea on optical character recognition and Chunking which achieve promising results.

[1]  Thomas G. Dietterich,et al.  Gradient Tree Boosting for Training Conditional Random Fields , 2008 .

[2]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[3]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[4]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[5]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[6]  Yunsong Guo,et al.  Comparisons of sequence labeling algorithms and extensions , 2007, ICML '07.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Xu Sun,et al.  Modeling Latent-Dynamic in Shallow Parsing: A Latent Conditional Model with Improved Inference , 2008, COLING.

[9]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[10]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[11]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[12]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[18]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[19]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[20]  Yusuke Miyao,et al.  Learning with Lookahead: Can History-Based Models Rival Globally Optimized Models? , 2011, CoNLL.

[21]  Tillman Weyde,et al.  Discriminative learning and inference in the Recurrent Temporal RBM for melody modelling , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[22]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[23]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[26]  Ben Taskar,et al.  Efficient Second-Order Gradient Boosting for Conditional Random Fields , 2015, AISTATS.

[27]  Thierry Artières,et al.  Neural conditional random fields , 2010, AISTATS.