Experiments in Microblog Summarization

Abstract —This paper presents algorithms for summarizingmicroblog posts. In particular, our algorithms process collectionsof short posts on specific topics on the well-known site calledTwitter and create short summaries from these collections ofposts on a specific topic. The goal is to produce summariesthat are similar to what a human would produce for the samecollection of posts on a specific topic. We evaluate the summariesproduced by the summarizing algorithms, compare them withhuman-produced summaries and obtain excellent results. I. I NTRODUCTION Twitter, the microblogging site started in 2006, has becomea social phenomenon, with more than 20 million visitors eachmonth. While the majority posts are conversational or notvery meaningful, about 3.6% of the posts concern topics ofmainstream news 1 . At the end of 2009, Twitter had 75 millionaccount holders, of which about 20% are active 2 . There areapproximately 2.5 million Twitter posts per day 3 . To helppeople who read Twitter posts or tweets, Twitter provides ashort list of popular topics called

[1]  Stephen Wan,et al.  Generating Overview Summaries of Ongoing Email Thread Discussions , 2004, COLING.

[2]  Robert G. Farrell,et al.  Summarizing electronic discourse , 2002, Intell. Syst. Account. Finance Manag..

[3]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[4]  Jugal Kalita,et al.  A response to the need for summary responses , 1984 .

[5]  Inderjeet Mani,et al.  The Challenges of Automatic Summarization , 2000, Computer.

[6]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[7]  J. Kalita,et al.  Automatic Summarization of Twitter Topics , 2010 .

[8]  Yohei Seki,et al.  Sentence Extraction by tf/idf and Position Weighting from Newspaper Articles , 2002, NTCIR.

[9]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[10]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[11]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[12]  Ee-Peng Lim,et al.  Comments-oriented blog summarization by sentence extraction , 2007, CIKM '07.

[13]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[14]  Gordon I. McCalla,et al.  Summarizing Natural Language Database Responses , 1986, Comput. Linguistics.

[15]  Jugal K. Kalita,et al.  Summarization as feature selection for text categorization , 2001, CIKM '01.

[16]  Kavi Mahesh Hypertext Summary Extraction for Fast Document Browsing , 1997 .

[17]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[18]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[19]  Liang Zhou,et al.  On the Summarization of Dynamically Introduced Information: Online Discussions and Blogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[20]  Jugal K. Kalita,et al.  Summarizing Microblogs Automatically , 2010, NAACL.

[21]  Djoerd Hiemstra,et al.  Automatic summarisation of discussion fora , 2010, Natural Language Engineering.

[22]  Mike Klaas Toward indicative discussion fora summarization , 2005 .

[23]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[24]  Owen Rambow,et al.  Summarizing Email Threads , 2004, NAACL.

[25]  Derek Scott Lam,et al.  Exploiting E-mail Structure to Improve Summarization , 2002 .

[26]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[27]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[28]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.