Exploiting Topical Perceptions over Multi-Lingual Text for Hashtag Suggestion on Twitter

Microblogging websites, such as Twitter, provide seemingly endless amount of textual information on a wide variety of topics generated by a large number of users.Microblog posts, or tweets in Twitter, are often written in an informal manner using multi-lingual styles. Ignoring informal styles or multiple languages can hamper the usefulness of microblogging mining applications.In this paper, we present a statistical method for processing tweets according to users perceptions of topics and hashtags. Based on the non-classical notion of relatedness of vocabulary terms to topics in a corpus,which is quantified by discriminative term weights, our method builds a ranked list of terms related to hashtags.Subsequently, given a new tweet, our method can suggesta ranked list of hashtags. Our method allows enhanced understanding and normalization of users perceptionsfor improved information retrieval applications.We evaluate our method on a dataset of 14 million tweets collected over a period of 52 days. Results demonstrate that the method actually learns useful relationships between vocabulary terms and topics, and that the performance is better than a Naive Bayes suggestion system.