Linking online news and social media

Much of what is discussed in social media is inspired by events in the news and, vice versa, social media provide us with a handle on the impact of news events. We address the following linking task: given a news article, find social media utterances that implicitly reference it. We follow a three-step approach: we derive multiple query models from a given source news article, which are then used to retrieve utterances from a target social media index, resulting in multiple ranked lists that we then merge using data fusion techniques. Query models are created by exploiting the structure of the source article and by using explicitly linked social media utterances that discuss the source article. To combat query drift resulting from the large volume of text, either in the source news article itself or in social media utterances explicitly linked to it, we introduce a graph-based method for selecting discriminative terms. For our experimental evaluation, we use data from Twitter, Digg, Delicious, the New York Times Community, Wikipedia, and the blogosphere to generate query models. We show that different query models, based on different data sources, provide complementary information and manage to retrieve different social media utterances from our target index. As a consequence, data fusion methods manage to significantly boost retrieval performance over individual approaches. Our graph-based term selection method is shown to help improve both effectiveness and efficiency.

[1]  Maarten de Rijke,et al.  A Generative Blog Post Retrieval Model that Uses Query Expansion based on External Collections , 2009, ACL/IJCNLP.

[2]  Hsinchun Chen,et al.  Collaborative systems: solving the vocabulary problem , 1994, Computer.

[3]  Dan Wu,et al.  Toward a Robust data fusion for document retrieval , 2008, 2008 International Conference on Natural Language Processing and Knowledge Engineering.

[4]  Ophir Frieder,et al.  Fusion of effective retrieval strategies in the same information retrieval system , 2004, J. Assoc. Inf. Sci. Technol..

[5]  Daisuke Ikeda,et al.  Automatically Linking News Articles to Blog Entries , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[6]  A. Bell The language of news media , 1991 .

[7]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[8]  Michael Gamon,et al.  BLEWS: Using Blogs to Provide Context for News Articles , 2008, ICWSM.

[9]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[10]  Iadh Ounis,et al.  Overview of the TREC-2009 Blog Track | NIST , 2008 .

[11]  M. de Rijke,et al.  Predicting the volume of comments on online news stories , 2009, CIKM.

[12]  Ophir Frieder,et al.  System fusion for improving performance in information retrieval systems , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[13]  Bernardo A. Huberman,et al.  Predicting the popularity of online content , 2008, Commun. ACM.

[14]  M. Thelwall Bloggers during the London attacks: Top information sources and topics , 2006 .

[15]  Christos Faloutsos,et al.  Cascading Behavior in Large Blog Graphs , 2007 .

[16]  Ravi Kumar,et al.  Structure and evolution of blogspace , 2004, CACM.

[17]  Barry Smyth,et al.  Using twitter to recommend real-time topical news , 2009, RecSys '09.

[18]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[19]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[20]  Javed A. Aslam,et al.  Relevance score normalization for metasearch , 2001, CIKM '01.

[21]  Qiang Wu,et al.  Click-through prediction for news queries , 2009, SIGIR.

[22]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[23]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[24]  Craig MacDonald,et al.  News article ranking: leveraging the wisdom of bloggers , 2010, RIAO.

[25]  Matthew Hurst,et al.  Event Detection and Tracking in Social Streams , 2009, ICWSM.

[26]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[27]  Yasufumi Takama,et al.  Visualization of News Distribution in Blog Space , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[28]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[29]  Craig MacDonald,et al.  Overview of the TREC 2009 Blog Track , 2009, TREC.

[30]  Craig MacDonald,et al.  Overview of the TREC 2006 Blog Track , 2006, TREC.

[31]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[32]  Lipika Dey,et al.  Studying the effects of noisy text on text mining applications , 2009, AND '09.

[33]  Jaime G. Carbonell,et al.  Document Representation and Query Expansion Models for Blog Recommendation , 2008, ICWSM.

[34]  Andrew Trotman,et al.  Overview of the INEX 2010 Link the Wiki Track , 2010, INEX.

[35]  SzaboGabor,et al.  Predicting the popularity of online content , 2010 .

[36]  Nick Koudas,et al.  Early online identification of attention gathering items in social media , 2010, WSDM '10.

[37]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[38]  Eytan Adar,et al.  Implicit Structure and the Dynamics of Blogspace , 2004 .

[39]  W. Bruce Croft,et al.  Combining the language model and inference network approaches to retrieval , 2004, Inf. Process. Manag..

[40]  Jianfeng Gao,et al.  Linear discriminant model for information retrieval , 2005, SIGIR '05.

[41]  E. A. Fox,et al.  Combining the Evidence of Multiple Query Representations for Information Retrieval , 1995, Inf. Process. Manag..

[42]  Gilad Mishne,et al.  A Study of Blog Search , 2006, ECIR.

[43]  Gilad Mishne,et al.  Why Are They Excited? Identifying and Explaining Spikes in Blog Mood Levels , 2006, EACL.

[44]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[45]  M. de Rijke,et al.  Credibility Improves Topical Blog Post Retrieval , 2008, ACL.

[46]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.