Natural Language Understanding with Distributed Representation

This is a lecture note for the course DS-GA 3001 at the Center for Data Science , New York University in Fall, 2015. As the name of the course suggests, this lecture note introduces readers to a neural network based approach to natural language understanding/processing. In order to make it as self-contained as possible, I spend much time on describing basics of machine learning and neural networks, only after which how they are used for natural languages is introduced. On the language front, I almost solely focus on language modelling and machine translation, two of which I personally find most fascinating and most fundamental to natural language understanding.

[1]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[2]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[3]  J. Tenenbaum,et al.  Poverty of the Stimulus? A Rational Approach , 2006 .

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[6]  T. Kuhn,et al.  The Structure of Scientific Revolutions. , 1964 .

[7]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[8]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[9]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[10]  S. Srihari Mixture Density Networks , 1994 .

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[13]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ying Zhang,et al.  Automatic Acquisition of Chinese-English Parallel Corpus from the Web , 2006, ECIR.

[15]  Dianhai Yu,et al.  Multi-Task Learning for Multiple Language Translation , 2015, ACL.

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  R. Darnell Translation , 1873, The Indian medical gazette.

[18]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[19]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[21]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[22]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[23]  C. Shannon,et al.  The bandwagon (Edtl.) , 1956 .

[24]  J. Michael Verbal behavior. , 1984, Journal of the experimental analysis of behavior.

[25]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[27]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[28]  W. N. Locke,et al.  Machine Translation of Languages , 1956 .

[29]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[30]  Jason Weston,et al.  Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[31]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[32]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[33]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[34]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[35]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[37]  Nadir Durrani,et al.  Edinburgh’s Phrase-based Machine Translation Systems for WMT-14 , 2014, WMT@ACL.

[38]  Mikel L. Forcada,et al.  Recursive Hetero-associative Memories for Translation , 1997, IWANN.

[39]  H. Robbins A Stochastic Approximation Method , 1951 .

[40]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[41]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[42]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[43]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[44]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[45]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[46]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[47]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[48]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[49]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[50]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[51]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[52]  Janina Maier,et al.  Syntax A Generative Introduction , 2016 .

[53]  Noam Chomsky,et al.  A Review of B. F. Skinner's Verbal Behavior , 1980 .

[54]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[55]  Eric A. Hansen,et al.  Beam-Stack Search: Integrating Backtracking with Beam Search , 2005, ICAPS.

[56]  Kyunghyun Cho,et al.  Larger-Context Language Modelling , 2015, ArXiv.

[57]  Philipp Koehn,et al.  Dirt Cheap Web-Scale Parallel Text from the Common Crawl , 2013, ACL.

[58]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[59]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[60]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[61]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[62]  David Furcy,et al.  Limited Discrepancy Beam Search , 2005, IJCAI.

[63]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[64]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[65]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[66]  Yann LeCun,et al.  Transforming Neural-Net Output Levels to Probability Distributions , 1990, NIPS.

[67]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[68]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[69]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[70]  R. Fletcher Practical Methods of Optimization , 1988 .

[71]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[72]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[73]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[74]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[75]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[76]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[77]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[78]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[79]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[80]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[81]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[82]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[83]  Xinlei Chen,et al.  Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[84]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[85]  Noam Chomsky,et al.  Linguistic contributions to the study of mind: future , 2006 .

[86]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[87]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[88]  Chris Dyer,et al.  Document Context Language Models , 2015, ICLR 2015.

[89]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[90]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[91]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[92]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[93]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[94]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[95]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[96]  Yoshua Bengio,et al.  Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[97]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[98]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[99]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[100]  Alexander M. Rush,et al.  A Fast Variational Approach for Learning Markov Random Field Language Models , 2015, ICML.

[101]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[102]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[103]  Phil Blunsom,et al.  Pragmatic Neural Language Modelling in Machine Translation , 2014, NAACL.

[104]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[105]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[106]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[107]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Terry Winograd,et al.  Understanding natural language , 1974 .

[109]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[110]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[111]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.