On Analyzing Hashtags in Twitter

Hashtags, originally introduced in Twitter, are now becoming the most used way to tag short messages in social networks since this facilitates subsequent search, classification and clustering over those messages. However, extracting information from hashtags is difficult because their composition is not constrained by any (linguistic) rule and they usually appear in short and poorly written messages which are difficult to analyze with classic IR techniques. In this paper we address two challenging problems regarding the meaning of hashtags — namely, hashtag relatedness and hashtag classification - and we provide two main contributions. First we build a novel graph upon hashtags and (Wikipedia) entities drawn from the tweets by means of topic annotators (such as TagME); this graph will allow us to model in an efficacious way not only classic co-occurrences but also semantic relatedness among hashtags and entities, or between entities themselves. Based on this graph, we design algorithms that significantly improve state-of-the-art results upon known publicly available datasets. The second contribution is the construction and the public release to the research community of two new datasets: the former is a new dataset for hashtag relatedness, the latter is a dataset for hashtag classification that is up to two orders of magnitude larger than the existing ones. These datasets will be used to show the robustness and efficacy of our approaches, showing improvements in F1 up to two-digits in percentage (absolute).

[1]  Roelof van Zwol,et al.  Classifying tags using open content resources , 2009, WSDM '09.

[2]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[3]  Raphaël Troncy,et al.  GERBIL: General Entity Annotator Benchmarking Framework , 2015, WWW.

[4]  Francesco Bonchi,et al.  From machu_picchu to "rafting the urubamba river": anticipating information needs via the entity-query graph , 2013, WSDM '13.

[5]  Giovanni Quattrone,et al.  Effective retrieval of resources in folksonomies using a new tag similarity measure , 2011, CIKM '11.

[6]  Ciro Cattuto,et al.  Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems , 2008, LWA.

[7]  Lei Yang,et al.  We know what @you #tag: does the dual role affect hashtag adoption? , 2012, WWW.

[8]  Peter Mika,et al.  Making Sense of Twitter , 2010, SEMWEB.

[9]  Ciro Cattuto,et al.  Evaluating similarity measures for emergent semantics of social tagging , 2009, WWW '09.

[10]  M. Cugmas,et al.  On comparing partitions , 2015 .

[11]  Ricardo A. Baeza-Yates,et al.  Extracting semantic relations from query logs , 2007, KDD '07.

[12]  Efthimis N. Efthimiadis,et al.  Conversational tagging in twitter , 2010, HT '10.

[13]  Hinrich Schütze,et al.  The SMAPH system for query entity recognition and disambiguation , 2014, ERD '14.

[14]  Giuseppe Ottaviano,et al.  Fast and Space-Efficient Entity Linking for Queries , 2015, WSDM.

[15]  Paolo Ferragina,et al.  Classification of Short Texts by Deploying Topical Annotations , 2012, ECIR.

[16]  Gerhard Weikum,et al.  Knowledge harvesting in the big-data era , 2013, SIGMOD '13.

[17]  Xiaolong Wang,et al.  Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach , 2011, CIKM '11.

[18]  Salvatore Orlando,et al.  Learning relatedness measures for entity linking , 2013, CIKM.

[19]  Halit Oguztüzün,et al.  Semantic Expansion of Tweet Contents for Enhanced Event Detection in Twitter , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[20]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[21]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[22]  Massimiliano Ciaramita,et al.  A framework for benchmarking entity-annotation systems , 2013, WWW.

[23]  Ming-Wei Chang,et al.  ERD'14 , 2014, SIGIR Forum.

[24]  Houfeng Wang,et al.  Entity-centric topic-oriented opinion summarization in twitter , 2012, KDD.

[25]  Andrea Marino,et al.  Topical clustering of search results , 2012, WSDM '12.

[26]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[27]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[28]  Krisztian Balog,et al.  Entity linking and retrieval for semantic search , 2014, WSDM.

[29]  Aixin Sun,et al.  Hashtag recommendation for hyperlinked tweets , 2014, SIGIR.

[30]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[31]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[32]  Markus Strohmaier,et al.  Meaning as collective use: predicting semantic hashtag categories on twitter , 2013, WWW.