Gating Mechanisms for Combining Character and Word-level Word Representations: an Empirical Study

In this paper we study how different ways of combining character and word-level representations affect the quality of both final word and sentence representations. We provide strong empirical evidence that modeling characters improves the learned representations at the word and sentence levels, and that doing so is particularly useful when representing less frequent words. We further show that a feature-wise sigmoid gating mechanism is a robust method for creating representations that encode semantic similarity, as it performed reasonably well in several word similarity datasets. Finally, our findings suggest that properly capturing semantic similarity at the word level does not consistently yield improved performance in downstream sentence-level tasks. Our code is available at this https URL

[1]  Evgeniy Gabrilovich,et al.  A word at a time: computing word relatedness using temporal semantic analysis , 2011, WWW.

[2]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[5]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[6]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[7]  Gerard de Melo,et al.  Exploring Semantic Properties of Sentence Embeddings , 2018, ACL.

[8]  Yoshua Bengio,et al.  Feature-wise transformations , 2018, Distill.

[9]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[10]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[12]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[13]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[14]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[15]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[16]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Wojciech Czarnecki,et al.  How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks , 2017, ArXiv.

[19]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[20]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[21]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[22]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[23]  Felix Hill,et al.  SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity , 2016, EMNLP.

[24]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[25]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[26]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[27]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[28]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[29]  Neville Ryant,et al.  A large-scale classification of English verbs , 2008, Lang. Resour. Evaluation.

[30]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[31]  Anna Gladkova,et al.  Intrinsic Evaluations of Word Embeddings: What Can We Do Better? , 2016, RepEval@ACL.

[32]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[33]  Kyunghyun Cho,et al.  Gated Word-Character Recurrent Language Model , 2016, EMNLP.

[34]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Michael P. Rogers Python Tutorial , 2009 .

[37]  Han Zhao,et al.  Self-Adaptive Hierarchical Sentence Model , 2015, IJCAI.

[38]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[39]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[40]  Yulia Tsvetkov,et al.  Problems With Evaluation of Word Embeddings Using Word Similarity Tasks , 2016, RepEval@ACL.

[41]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[42]  Evgeniy Gabrilovich,et al.  Large-scale learning of word relatedness with constraints , 2012, KDD.

[43]  Christopher D. Manning,et al.  Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[44]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[45]  Yoav Goldberg,et al.  The Interplay of Semantics and Morphology in Word Embeddings , 2017, EACL.

[46]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[47]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[48]  Wei Li,et al.  Learning Universal Sentence Representations with Mean-Max Attention Autoencoder , 2018, EMNLP.

[49]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[50]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[51]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[52]  Ye Yuan,et al.  Words or Characters? Fine-grained Gating for Reading Comprehension , 2016, ICLR.

[53]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[54]  Guillaume Lample,et al.  Evaluation of Word Vector Representations by Subspace Alignment , 2015, EMNLP.

[55]  Zhe Gan,et al.  Learning Generic Sentence Representations Using Convolutional Neural Networks , 2016, EMNLP.

[56]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[57]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[58]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[59]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[60]  Christian S. Perone,et al.  Evaluation of sentence embeddings in downstream and linguistic probing tasks , 2018, ArXiv.

[61]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[62]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[63]  Sampo Pyysalo,et al.  Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance , 2016, RepEval@ACL.

[64]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[65]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[66]  Christopher Joseph Pal,et al.  Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , 2018, ICLR.

[67]  Felix Hill,et al.  Learning Distributed Representations of Sentences from Unlabelled Data , 2016, NAACL.

[68]  Douwe Kiela,et al.  No Training Required: Exploring Random Encoders for Sentence Classification , 2019, ICLR.

[69]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[70]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[71]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[72]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[73]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[74]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[75]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[76]  Samuel R. Bowman,et al.  Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning , 2017, ArXiv.