Towards robust word embeddings for noisy texts

Research on word embeddings has mainly focused on improving their performance on standard corpora, disregarding the difficulties posed by noisy texts in the form of tweets and other types of non-standard writing from social media. In this work, we propose a simple extension to the skipgram model in which we introduce the concept of bridge-words, which are artificial words added to the model to strengthen the similarity between standard words and their noisy variants. Our new embeddings outperform baseline models on noisy texts on a wide range of evaluation tasks, both intrinsic and extrinsic, while retaining a good performance on standard texts. To the best of our knowledge, this is the first explicit approach at dealing with this type of noisy texts at the word embedding level that goes beyond the support for out-of-vocabulary words.

[1]  Micha Elsner,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2014 .

[2]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[3]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[4]  Roberto Navigli,et al.  Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations , 2016, RepEval@ACL.

[5]  Xiangliang Zhang,et al.  Dynamic Embeddings for User Profiling in Twitter , 2018, KDD.

[6]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[7]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[8]  Felix Hill,et al.  SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation , 2014, CL.

[9]  Usman Zafar,et al.  Improving Text Normalization by Optimizing Nearest Neighbor Matching , 2017, ArXiv.

[10]  Jason Weston,et al.  Learning End-to-End Goal-Oriented Dialog , 2016, ICLR.

[11]  Maria das Graças Volpe Nunes,et al.  Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization , 2016, NUT@COLING.

[12]  Marie-Francine Moens,et al.  Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings , 2015, SIGIR.

[13]  Vivek Kumar Rangarajan Sridhar Unsupervised Text Normalization Using Distributed Representations of Words and Phrases , 2015, VS@HLT-NAACL.

[14]  Dong-Hong Ji,et al.  Towards Twitter sentiment classification by multi-level sentiment-enriched word embeddings , 2016, Neurocomputing.

[15]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[16]  Akshi Kumar,et al.  Sentiment Analysis on Twitter , 2012 .

[17]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[18]  Malvina Nissim,et al.  To normalize, or not to normalize: The impact of normalization on Part-of-Speech tagging , 2017, NUT@EMNLP.

[19]  Christopher Joseph Pal,et al.  Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning , 2018, ICLR.

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[22]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[24]  Craig MacDonald,et al.  Using word embeddings in Twitter election classification , 2016, Information Retrieval Journal.

[25]  Varvara Logacheva,et al.  Robust Word Vectors: Context-Informed Embeddings for Noisy Texts , 2018, NUT@EMNLP.

[26]  Ingemar J. Cox,et al.  Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance , 2017, WWW.

[27]  Andrei Cimpian,et al.  An early-emerging explanatory heuristic promotes support for the status quo. , 2015, Journal of personality and social psychology.

[28]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[29]  Erik Velldal,et al.  Redefining part-of-speech classes with distributional semantic models , 2016, CoNLL.

[30]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[31]  Sampo Pyysalo,et al.  Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance , 2016, RepEval@ACL.

[32]  Timothy Baldwin,et al.  Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.

[33]  Benjamin Roth,et al.  Joint Aspect and Polarity Classification for Aspect-based Sentiment Analysis with End-to-End Neural Networks , 2018, EMNLP.

[34]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[35]  Yuval Merhav,et al.  Automated Generation of Multilingual Clusters for the Evaluation of Distributed Representations , 2016, ICLR.

[36]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[37]  Handling Noise in Distributional Semantic Models for Large Scale Text Analytics and Media Monitoring - Abstract , 2018 .

[38]  Jiwei Li,et al.  Is Word Segmentation Necessary for Deep Learning of Chinese Representations? , 2019, ACL.

[39]  Preslav Nakov,et al.  SemEval-2014 Task 9: Sentiment Analysis in Twitter , 2014, *SEMEVAL.

[40]  Grzegorz Chrupala,et al.  Normalizing tweets with edit scripts and recurrent neural embeddings , 2014, ACL.

[41]  Kevin Gimpel,et al.  Tailoring Continuous Word Representations for Dependency Parsing , 2014, ACL.

[42]  Veselin Stoyanov,et al.  Evaluation Measures for the SemEval-2016 Task 4 “Sentiment Analysis in Twitter” (Draft: Version 1.13) , 2016 .

[43]  Gertjan van Noord,et al.  MoNoise: Modeling Noise Using a Modular Normalization System , 2017, ArXiv.

[44]  Carlos Gómez-Rodríguez,et al.  Comparing neural‐ and N‐gram‐based language models for word segmentation , 2018, J. Assoc. Inf. Sci. Technol..

[45]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[46]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[49]  Christopher D. Manning,et al.  Bilingual Word Representations with Monolingual Quality in Mind , 2015, VS@HLT-NAACL.

[50]  Nigel Collier,et al.  SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity , 2017, *SEMEVAL.

[51]  Kevin Gimpel,et al.  Charagram: Embedding Words and Sentences via Character n-grams , 2016, EMNLP.

[52]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[53]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[54]  Yulia Tsvetkov,et al.  Problems With Evaluation of Word Embeddings Using Word Similarity Tasks , 2016, RepEval@ACL.

[55]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[56]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[57]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.