Learning similarity functions for topic detection in online reputation monitoring

Reputation management experts have to monitor--among others--Twitter constantly and decide, at any given time, what is being said about the entity of interest (a company, organization, personality...). Solving this reputation monitoring problem automatically as a topic detection task is both essential--manual processing of data is either costly or prohibitive--and challenging--topics of interest for reputation monitoring are usually fine-grained and suffer from data sparsity. We focus on a solution for the problem that (i) learns a pairwise tweet similarity function from previously annotated data, using all kinds of content-based and Twitter-based features; (ii) applies a clustering algorithm on the previously learned similarity function. Our experiments indicate that (i) Twitter signals can be used to improve the topic detection process with respect to using content signals only; (ii) learning a similarity function is a flexible and efficient way of introducing supervision in the topic detection clustering process. The performance of our best system is substantially better than state-of-the-art approaches and gets close to the inter-annotator agreement rate. A detailed qualitative inspection of the data further reveals two types of topics detected by reputation experts: reputation alerts / issues (which usually spike in time) and organizational topics (which are usually stable across time).

[1]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[2]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[3]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[4]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[5]  Bernard J. Jansen,et al.  Twitter power: Tweets as electronic word of mouth , 2009, J. Assoc. Inf. Sci. Technol..

[6]  Julio Gonzalo,et al.  The role of named entities in Web People Search , 2009, EMNLP.

[7]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[8]  M. de Rijke,et al.  Adding semantics to microblog posts , 2012, WSDM '12.

[9]  Jugal K. Kalita,et al.  Streaming trend detection in Twitter , 2013, Int. J. Web Based Communities.

[10]  Mette Skov,et al.  CLEF 2013 Evaluation Labs and Workshop, Online Working Notes , 2013 .

[11]  Ling Chen,et al.  Event detection from flickr data through wavelet-based spatial analysis , 2009, CIKM.

[12]  Martin Ester,et al.  On the design of LDA models for aspect-based opinion mining , 2012, CIKM.

[13]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[14]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[15]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[16]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[17]  Dmitri V. Kalashnikov,et al.  Exploiting Web querying for Web People Search in WePS2 , 2009 .

[18]  Mohamed Morchid,et al.  LIA@RepLab 2013 , 2013, CLEF.

[19]  Matthew Hurst,et al.  Deriving marketing intelligence from online discussion , 2005, KDD '05.

[20]  Julio Gonzalo,et al.  Overview of RepLab 2013: Evaluating Online Reputation Monitoring Systems , 2013, CLEF.

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[22]  Ángel F. Zazo Rodríguez,et al.  REINA at RepLab2013 Topic Detection Task: Community Detection , 2013, CLEF.

[23]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[24]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[25]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[26]  Chu-Ren Huang,et al.  PolyUHK: A Robust Information Extraction System for Web PersonalNames , 2009 .

[27]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[28]  Zhoujun Li,et al.  Emerging topic detection for organizations from microblogs , 2013, SIGIR.

[29]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[30]  Dmitri V. Kalashnikov,et al.  Exploiting Web querying for Web people search , 2012, ACM Trans. Database Syst..

[31]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[32]  Christian Sánchez-Sánchez,et al.  UAMCLyR at Replab2013: Monitoring Task , 2013, CLEF.

[33]  D. Gunopulos,et al.  Discovering Hot Topics in the Blogosphere , 2008 .

[34]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[35]  David Yarowsky,et al.  Multi-document statistical fact extraction and fusion , 2006 .

[36]  Maarten de Rijke,et al.  Identifying entity aspects in microblog posts , 2012, SIGIR '12.

[37]  Julio Gonzalo,et al.  UNED Online Reputation Monitoring Team at RepLab 2013 , 2013, CLEF.

[38]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[39]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[40]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[41]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.