Effect of Non-linear Deep Architecture in Sequence Labeling

If we compare the widely used Conditional Random Fields (CRF) with newly proposed “deep architecture” sequence models (Collobert et al., 2011), there are two things changing: from linear architecture to non-linear, and from discrete feature representation to distributional. It is unclear, however, what utility nonlinearity offers in conventional featurebased models. In this study, we show the close connection between CRF and “sequence model” neural nets, and present an empirical investigation to compare their performance on two sequence labeling tasks ‐ Named Entity Recognition and Syntactic Chunking. Our results suggest that non-linear models are highly effective in low-dimensional distributional spaces. Somewhat surprisingly, we find that a nonlinear architecture offers no benefits in a high-dimensional discrete feature space.

[1]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[2]  Gilles Pagès,et al.  Approximations of Functions by a Multilayer Perceptron: a New Approach , 1997, Neural Networks.

[3]  Samy Bengio,et al.  Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks , 1999, NIPS.

[4]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[5]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[6]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[7]  Michel Verleysen,et al.  On the Effects of Dimensionality on Data Analysis with Neural Networks , 2009, IWANN.

[8]  Francesco Camastra,et al.  Data dimensionality estimation methods: a survey , 2003, Pattern Recognit..

[9]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[10]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[11]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[12]  Dan Klein,et al.  Structure compilation: trading structure for features , 2008, ICML '08.

[13]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[14]  Jian Peng,et al.  Conditional Neural Fields , 2009, NIPS.

[15]  Eric Fosler-Lussier,et al.  Backpropagation training for multilayer conditional random field based phone recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Thierry Artières,et al.  Neural conditional random fields , 2010, AISTATS.

[17]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[18]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[19]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[20]  Xaq Pitkow,et al.  "Compressive neural representation of sparse, high-dimensional probabilities" , 2012, NIPS.