Symbolic, Distributed, and Distributional Representations for Natural Language Processing in the Era of Deep Learning: A Survey

Natural language is inherently a discrete symbolic representation of human knowledge. Recent advances in machine learning (ML) and in natural language processing (NLP) seem to contradict the above intuition: discrete symbols are fading away, erased by vectors or tensors called distributed and distributional representations. However, there is a strict link between distributed/distributional representations and discrete symbols, being the first an approximation of the second. A clearer understanding of the strict link between distributed/distributional representations and symbols may certainly lead to radically new deep learning networks. In this paper we make a survey that aims to renew the link between symbolic representations and distributed/distributional representations. This is the right time to revitalize the area of interpreting how discrete symbols are represented inside neural networks.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[3]  Slav Petrov,et al.  Structured Training for Neural Network Transition-Based Parsing , 2015, ACL.

[4]  Stephen Clark,et al.  Mathematical Foundations for a Compositional Distributional Model of Meaning , 2010, ArXiv.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  H FriedmanJerome On Bias, Variance, 0/1Loss, and the Curse-of-Dimensionality , 1997 .

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[9]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  Mirella Lapata,et al.  Dependency-Based Construction of Semantic Space Models , 2007, CL.

[12]  S. Clark,et al.  A Compositional Distributional Model of Meaning , 2008 .

[13]  Phil Blunsom,et al.  Recurrent Convolutional Neural Networks for Discourse Compositionality , 2013, CVSM@ACL.

[14]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[15]  Alberto D. Pascual-Montano,et al.  A survey of dimensionality reduction techniques , 2014, ArXiv.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Fabio Massimo Zanzotto,et al.  When the Whole Is Not Greater Than the Combination of Its Parts: A “Decompositional” Look at Compositional Distributional Semantics , 2015, Computational Linguistics.

[18]  Fabio Massimo Zanzotto,et al.  Distributed Tree Kernels , 2012, ICML.

[19]  Mehrnoosh Sadrzadeh,et al.  Experimental Support for a Categorical Compositional Distributional Model of Meaning , 2011, EMNLP.

[20]  Tim van Gelder,et al.  Compositionality: A Connectionist Variation on a Classical Theme , 1990, Cogn. Sci..

[21]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[22]  Cheng-Yuan Liou,et al.  Autoencoder for words , 2014, Neurocomputing.

[23]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[24]  Ido Dagan,et al.  Recognizing Textual Entailment: Models and Applications , 2013, Recognizing Textual Entailment: Models and Applications.

[25]  J. Huang,et al.  Curse of dimensionality and particle filters , 2003, 2003 IEEE Aerospace Conference Proceedings (Cat. No.03TH8652).

[26]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[27]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[28]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[29]  Mirella Lapata,et al.  Vector-based Models of Semantic Composition , 2008, ACL.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Geoffrey E. Hinton,et al.  Distributed representations and nested compositional structure , 1994 .

[32]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[33]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[34]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[35]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Ivan Markovsky,et al.  Low Rank Approximation - Algorithms, Implementation, Applications , 2018, Communications and Control Engineering.

[38]  G. Frege Die Grundlagen der Arithmetik : eine logisch mathematische Untersuchung über den Begriff der Zahl , 1884 .

[39]  G. Dunteman Principal Components Analysis , 1989 .

[40]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[41]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[42]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[44]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[45]  Peter D. Turney Similarity of Semantic Relations , 2006, CL.

[46]  Alessandro Lenci,et al.  Distributional Memory: A General Framework for Corpus-Based Semantics , 2010, CL.

[47]  Eamonn J. Keogh,et al.  Curse of Dimensionality , 2010, Encyclopedia of Machine Learning.

[48]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[49]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[50]  R. Montague Formal philosophy; selected papers of Richard Montague , 1974 .

[51]  Richard Montague,et al.  ENGLISH AS A FORMAL LANGUAGE , 1975 .

[52]  VincentPascal,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010 .

[53]  Andrew A. Chien Introducing Communications' regional special sections , 2018, Commun. ACM.

[54]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[55]  Geoffrey E. Hinton,et al.  Distributed Representations , 1986, The Philosophy of Artificial Intelligence.

[56]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[57]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[58]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[59]  David J. Chalmers,et al.  Syntactic Transformations on Distributed Representations , 1990 .

[60]  Tony A. Plate,et al.  Holographic reduced representations , 1995, IEEE Trans. Neural Networks.

[61]  Jane Neumann,et al.  Holistic processing of hierarchical structures in connectionist networks , 2001 .

[62]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[63]  Petra Hendriks,et al.  A new hypothesis on compositionality , 2003 .

[64]  Sung-Hyon Myaeng,et al.  Interpretable Word Embedding Contextualization , 2018, BlackboxNLP@EMNLP.

[65]  J. Firth Papers in linguistics , 1958 .

[66]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[67]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[68]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[69]  Marco Baroni,et al.  Frege in Space: A Program for Composition Distributional Semantics , 2014, LILT.

[70]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[71]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[72]  SchmidhuberJürgen Deep learning in neural networks , 2015 .

[73]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[74]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[75]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[76]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[77]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[78]  Ryan Cotterell,et al.  Explaining and Generalizing Skip-Gram through Exponential Family Principal Component Analysis , 2017, EACL.

[79]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[80]  E. Guevara A Regression Model of Adjective-Noun Compositionality in Distributional Semantics , 2010 .

[81]  Yoav Goldberg,et al.  Understanding Convolutional Neural Networks for Text Classification , 2018, BlackboxNLP@EMNLP.

[82]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[83]  J. Fodor,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[84]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[85]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[86]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[87]  Hinrich Schutze,et al.  Unsupervised Classification with Dependency Based Word Spaces , 2009 .

[88]  Fabio Massimo Zanzotto,et al.  Decoding Distributed Tree Structures , 2015, SLSP.

[89]  I K Fodor,et al.  A Survey of Dimension Reduction Techniques , 2002 .

[90]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[91]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.

[92]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[93]  Marco Baroni,et al.  Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space , 2010, EMNLP.

[94]  Ioannis Korkontzelos,et al.  Estimating Linear Models for Compositional Distributional Semantics , 2010, COLING.

[95]  Fabio Massimo Zanzotto,et al.  Towards Syntax-aware Compositional Distributional Semantic Models , 2014, COLING.