论文信息 - Tweet2Vec: Character-Based Distributed Representations for Social Media

Tweet2Vec: Character-Based Distributed Representations for Social Media

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

[1] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[2] Alberto Maria Segre,et al. The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[3] Wang Ling,et al. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[4] Wesley De Neve,et al. Using topic models for Twitter hashtag recommendation , 2013, WWW.

[5] Cícero Nogueira dos Santos,et al. Boosting Named Entity Recognition with Neural Character Embeddings , 2015, NEWS@ACL.

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Fei-Fei Li,et al. Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[8] Jason Weston,et al. #TagSpace: Semantic Embeddings from Hashtags , 2014, EMNLP.

[9] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[10] Fabrizio Silvestri,et al. Context- and Content-aware Embeddings for Query Rewriting in Sponsored Search , 2015, SIGIR.

[11] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[12] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[13] Christopher D. Manning,et al. Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[14] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[15] Cícero Nogueira dos Santos,et al. Learning Character-level Representations for Part-of-Speech Tagging , 2014, ICML.

[16] Xinyu Dai,et al. Topic2Vec: Learning distributed representations of topics , 2015, 2015 International Conference on Asian Language Processing (IALP).

[17] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[18] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.