Bridging CNNs, RNNs, and Weighted Finite-State Machines

Recurrent and convolutional neural networks comprise two distinct families of models that have proven to be useful for encoding natural language utterances. In this paper we present SoPa, a new model that aims to bridge these two approaches. SoPa combines neural representation learning with weighted finite-state automata (WFSAs) to learn a soft version of traditional surface patterns. We show that SoPa is an extension of a one-layer CNN, and that such CNNs are equivalent to a restricted version of SoPa, and accordingly, to a restricted form of WFSA. Empirically, on three text classification tasks, SoPa is comparable or better than both a BiLSTM (RNN) baseline and a CNN baseline, and is particularly useful in small data settings.

[1]  Wenpeng Yin,et al.  Multichannel Variable-Size Convolution for Sentence Classification , 2015, CoNLL.

[2]  Ari Rappoport,et al.  ICWSM - A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews , 2010, ICWSM.

[3]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[4]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[5]  Marcel Paul Schützenberger,et al.  On the Definition of a Family of Automata , 1961, Inf. Control..

[6]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[7]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[8]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[9]  Markus Dreyer,et al.  A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings , 2011 .

[10]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[13]  Jason Eisner,et al.  Parameter Estimation for Probabilistic Finite-State Transducers , 2002, ACL.

[14]  Dani Yogatama,et al.  Bayesian Optimization of Text Representations , 2015, EMNLP.

[15]  Ming Zhou,et al.  Identifying Synonyms among Distributionally Similar Words , 2003, IJCAI.

[16]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[17]  Ryan Cotterell,et al.  Weighting Finite-State Transductions With Neural Context , 2016, NAACL.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Ryan Cotterell,et al.  Modeling Word Forms Using Latent Underlying Morphs and Phonology , 2015, TACL.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Paul Gastin,et al.  The Kleene-Schützenberger Theorem for Formal Power Series in Partially Commuting Variables , 1999, Inf. Comput..

[22]  Roy Schwartz,et al.  Symmetric Patterns and Coordinations: Fast and Enhanced Representations of Verbs and Adjectives , 2016, HLT-NAACL.

[23]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[24]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[25]  Omer Levy,et al.  Recurrent Additive Networks , 2017, ArXiv.

[26]  Roy Schwartz,et al.  Symmetric Pattern Based Word Embeddings for Improved Word Similarity Prediction , 2015, CoNLL.

[27]  Guillaume Lample,et al.  Evaluation of Word Vector Representations by Subspace Alignment , 2015, EMNLP.

[28]  Hava T. Siegelmann,et al.  On the computational power of neural nets , 1992, COLT '92.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[31]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[32]  C. Lee Giles,et al.  Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[33]  Peng Zhou,et al.  Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling , 2016, COLING.

[34]  Joshua Goodman,et al.  Semiring Parsing , 1999, CL.

[35]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[36]  Regina Barzilay,et al.  Rationalizing Neural Predictions , 2016, EMNLP.

[37]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[38]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[39]  Ari Rappoport,et al.  Unsupervised Discovery of Generic Relationships Using Pattern Clusters and its Evaluation by Automatically Generated SAT Analogy Questions , 2008, ACL.

[40]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[41]  J. Sakarovitch Rational and Recognisable Power Series , 2009 .

[42]  Ari Rappoport,et al.  Enhanced Sentiment Learning Using Twitter Hashtags and Smileys , 2010, COLING.

[43]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[44]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[45]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[46]  Ido Dagan,et al.  Improving Hypernymy Detection with an Integrated Path-based and Distributional Method , 2016, ACL.

[47]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[48]  Ralph Grishman,et al.  Modeling Skip-Grams for Event Detection with Convolutional Neural Networks , 2016, EMNLP.

[49]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[50]  Noah A. Smith,et al.  Linguistic Structured Sparsity in Text Categorization , 2014, ACL.

[51]  Maartje E. J. Raijmakers,et al.  Hidden Markov Model Interpretations of Neural Networks , 2000, NCPW.

[52]  I. Lee Hetherington The MIT finite-state transducer toolkit for speech and language processing , 2004, INTERSPEECH.

[53]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[54]  Hava T. Siegelmann,et al.  On the Computational Power of Neural Nets , 1995, J. Comput. Syst. Sci..

[55]  Grzegorz Chrupala,et al.  Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.

[56]  Georg Heigold,et al.  WFST Enabled Solutions to ASR Problems: Beyond HMM Decoding , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  Claire Cardie,et al.  Multi-Level Structured Models for Document-Level Sentiment Classification , 2010, EMNLP.

[58]  Jason Eisner,et al.  Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper) , 2016, SPNLP@EMNLP.

[59]  Andreas Maletti,et al.  Recurrent Neural Networks as Weighted Language Recognizers , 2017, NAACL.

[60]  Regina Barzilay,et al.  Molding CNNs for text: non-linear, non-consecutive convolutions , 2015, EMNLP.

[61]  Roy Schwartz,et al.  How Well Do Distributional Models Capture Different Types of Semantic Knowledge? , 2015, ACL.

[62]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[63]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[64]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[65]  Jithendra Vepa,et al.  Juicer: A Weighted Finite-State Transducer Speech Decoder , 2006, MLMI.

[66]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..