Search engine reinforced semi-supervised classification and graph-based summarization of microblogs

There is an abundance of information found on microblog services due to their popularity. However the potential of this trove of information is limited by the lack of effective means for users to browse and interpret the numerous messages found on these services. We tackle this problem using a two-step process, first by slicing up the search results of current retrieval systems along multiple possible genres. Then, a summary is generated from the microblog messages attributed to each genre. We believe that this helps users to better understand the possible interpretations of the retrieved results and aid them in finding the information that they need. Our novel approach makes use of automatically acquired information from external search engines in each of these two steps. We first integrate this information with a semi-supervised probabilistic graphical model, and show that this helps us to achieve significantly better classification performance without the need for much training data. Next we incorporate the extra information into graph-based summarization, and demonstrate that superior summaries (up to 30% improvement in ROUGE-1) are obtained.

[1]  Thair Nu Phyu Survey of Classification Techniques in Data Mining , 2009 .

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  David L. Waltz,et al.  Trading MIPS and memory for knowledge engineering , 1992, CACM.

[4]  Mirella Lapata,et al.  Multiple Aspect Summarization Using Integer Linear Programming , 2012, EMNLP.

[5]  Michael C. Mozer,et al.  Detecting Topic Drift with Compound Topic Models , 2009, ICWSM.

[6]  Jugal K. Kalita,et al.  Experiments in Microblog Summarization , 2010, 2010 IEEE Second International Conference on Social Computing.

[7]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[10]  Ani Nenkova,et al.  Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion , 2007, Information Processing & Management.

[11]  Roi Blanco,et al.  Probabilistic static pruning of inverted files , 2010, TOIS.

[12]  Jugal K. Kalita,et al.  Comparing Twitter Summarization Algorithms for Multiple Post Summaries , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[13]  Anatole Gershman,et al.  Topical Clustering of Tweets , 2011 .

[14]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[15]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[16]  Ani Nenkova,et al.  Automatic Summarization , 2011, ACL.

[17]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[18]  Huan Liu,et al.  Exploring Social-Historical Ties on Location-Based Social Networks , 2012, ICWSM.

[19]  Harry Shum,et al.  Twitter Topic Summarization by Ranking Tweets using Social Influence and Content Quality , 2012, COLING.

[20]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[21]  Eduard H. Hovy,et al.  From Single to Multi-document Summarization , 2002, ACL.

[22]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Ryan T. McDonald A Study of Global Inference Algorithms in Multi-document Summarization , 2007, ECIR.

[25]  Zhoujun Li,et al.  Emerging topic detection for organizations from microblogs , 2013, SIGIR.

[26]  Deepayan Chakrabarti,et al.  Event Summarization Using Tweets , 2011, ICWSM.

[27]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[28]  Thomas L. Griffiths,et al.  Learning author-topic models from text corpora , 2010, TOIS.

[29]  Hiroya Takamura,et al.  Summarizing a Document Stream , 2011, ECIR.

[30]  Sanda M. Harabagiu,et al.  Relevance Modeling for Microblog Summarization , 2011, ICWSM.

[31]  Arkaitz Zubiaga,et al.  Classifying trending topics: a typology of conversation triggers on Twitter , 2011, CIKM '11.

[32]  Srinivasan Parthasarathy,et al.  A framework for summarizing and analyzing twitter feeds , 2012, KDD.

[33]  Yang Liu,et al.  Why is “SXSW” trending? Exploring Multiple Text Sources for Twitter Topic Summarization , 2011 .

[34]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[35]  Xiaoming Zhang,et al.  A Semi-Supervised Bayesian Network Model for Microblog Topic Classification , 2012, COLING.

[36]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[37]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[38]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[39]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[40]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.

[41]  Wei Xu,et al.  A Preliminary Study of Tweet Summarization using Information Extraction , 2013 .

[42]  Craig MacDonald,et al.  Overview of the TREC-2012 Microblog Track , 2012, Text Retrieval Conference.

[43]  Alok N. Choudhary,et al.  Twitter Trending Topic Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[44]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[45]  Chew Lim Tan,et al.  Exploiting Category-Specific Information for Multi-Document Summarization , 2012, COLING.

[46]  Jugal K. Kalita,et al.  Summarizing Microblogs Automatically , 2010, NAACL.

[47]  George D. C. Cavalcanti,et al.  Assessing sentence scoring techniques for extractive text summarization , 2013, Expert Syst. Appl..

[48]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[49]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[50]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[51]  Andrei Olariu Hierarchical Clustering in Improving Microblog Stream Summarization , 2013, CICLing.

[52]  Julio Gonzalo,et al.  Towards real-time summarization of scheduled events from twitter streams , 2012, HT '12.

[53]  Jon M Kleinberg,et al.  Hubs, authorities, and communities , 1999, CSUR.