Tweet2Vec: Character-Based Distributed Representations for Social Media

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

[1]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[3]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[4]  Wesley De Neve,et al.  Using topic models for Twitter hashtag recommendation , 2013, WWW.

[5]  Cícero Nogueira dos Santos,et al.  Boosting Named Entity Recognition with Neural Character Embeddings , 2015, NEWS@ACL.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[8]  Jason Weston,et al.  #TagSpace: Semantic Embeddings from Hashtags , 2014, EMNLP.

[9]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[10]  Fabrizio Silvestri,et al.  Context- and Content-aware Embeddings for Query Rewriting in Sponsored Search , 2015, SIGIR.

[11]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[12]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[13]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[14]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[15]  Cícero Nogueira dos Santos,et al.  Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[16]  Xinyu Dai,et al.  Topic2Vec: Learning distributed representations of topics , 2015, 2015 International Conference on Asian Language Processing (IALP).

[17]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[18]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.