Summarizing a Document Stream

We introduce the task of summarizing a stream of short documents on microblogs such as Twitter. On microblogs, thousands of short documents on a certain topic such as sports matches or TV dramas are posted by users. Noticeable characteristics of microblog data are that documents are often very highly redundant and aligned on timeline. There can be thousands of documents on one event in the topic. Two very similar documents will refer to two distinct events when the documents are temporally distant. We examine the microblog data to gain more understanding of those characteristics, and propose a summarization model for a stream of short documents on timeline, along with an approximate fast algorithm for generating summary.We empirically show that our model generates a good summary on the datasets of microblog documents on sports matches.

[1]  David Jensen,et al.  TimeMines: Constructing Timelines with Statistical Models of Word Usage , 2000, KDD 2000.

[2]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[3]  J. Clarke,et al.  Global inference for sentence compression : an integer linear programming approach , 2008, J. Artif. Intell. Res..

[4]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[5]  Ani Nenkova,et al.  Automatic Summarization , 2011, ACL.

[6]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[7]  Ryan T. McDonald A Study of Global Inference Algorithms in Multi-document Summarization , 2007, ECIR.

[8]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[9]  Juraj Hromkovic,et al.  Algorithmics for Hard Problems , 2002, Texts in Theoretical Computer Science An EATCS Series.

[10]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[11]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[12]  Brendan T. O'Connor,et al.  TweetMotif: Exploratory Search and Topic Summarization for Twitter , 2010, ICWSM.

[13]  Hiroya Takamura,et al.  Text summarization model based on the budgeted median problem , 2009, CIKM.

[14]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[15]  Jugal K. Kalita,et al.  Summarizing Microblogs Automatically , 2010, NAACL.

[16]  Alireza Rezaei Mahdiraji Clustering data stream: A survey of algorithms , 2009, Int. J. Knowl. Based Intell. Eng. Syst..

[17]  Zvi Drezner,et al.  Facility location - applications and theory , 2001 .