Lexical Comparison Between Wikipedia and Twitter Corpora by Using Word Embeddings

Compared with carefully edited prose, the language of social media is informal in the extreme. The application of NLP techniques in this context may require a better understanding of word usage within social media. In this paper, we compute a word embedding for a corpus of tweets, comparing it to a word embedding for Wikipedia. After learning a transformation of one vector space to the other, and adjusting similarity values according to term frequency, we identify words whose usage differs greatly between the two corpora. For any given word, the set of words closest to it in a particular embedding provides a characterization for that word’s usage within the corresponding corpora.

[1]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[2]  Charles L. A. Clarke,et al.  Succinct Queries for Linking and Tracking News in Social Media , 2014, CIKM.

[3]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[4]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[5]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[6]  Bettina Berendt,et al.  Peddling or Creating? Investigating the Role of Twitter in News Reporting , 2011, ECIR.

[7]  Alistair Moffat,et al.  A similarity measure for indefinite rankings , 2010, TOIS.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Hsin-Hsi Chen,et al.  A Comparison between Microblog Corpus and Balanced Corpus from Linguistic and Sentimental Perspectives , 2011, Analyzing Microtext.

[10]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[11]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[12]  Max Kaufmann Syntactic Normalization of Twitter Messages , 2010 .

[13]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[14]  Timothy Baldwin,et al.  Automatically Constructing a Normalisation Dictionary for Microblogs , 2012, EMNLP.

[15]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[16]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.