HG-Rank: A Hypergraph-based Keyphrase Extraction for Short Documents in Dynamic Genre

Conventional keyphrase extraction algorithms are applied to a fixed corpus of lengthy documents where keyphrases distinguish documents from each other. However, with the emergence of social networks and microblogs, the nature of such documents has changed. Documents are now of short length and evolve topics which require specific algorithms to capture all features. In this paper, we propose a hypergraphbased ranking algorithm that models all the features in a random walk approach. Our random walk approach uses weights of both hyperedges and vertices to model short documents’ temporal and social features, as well as discriminative weights for word features respectively, while measuring the centrality of words in the hypergraph. We empirically test the eectiveness of our approach in two dierent data sets of short documents and show that our approach has an improvement of 14% to 25% in precision over the closest baseline in a Twitter data set and 10% to 27% in the Opinosis data set.

[1]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[2]  Thorsten Joachims,et al.  Temporal corpus summarization using submodular word coverage , 2012, CIKM '12.

[3]  Furu Wei,et al.  HyperSum: hypergraph based semi-supervised sentence ranking for query-oriented summarization , 2009, CIKM.

[4]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[5]  Jiawei Han,et al.  Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions , 2010, COLING.

[6]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[7]  Florian Boudin,et al.  TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction , 2013, IJCNLP.

[8]  Jin Liu,et al.  The Hot Keyphrase Extraction Based on TF*PDF , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[9]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[10]  Yitong Li,et al.  Graph-Based Multi-Tweet Summarization using Social Signals , 2012, COLING.

[11]  Abdelghani Bellaachia,et al.  Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-Based Keyphrase Extraction , 2012, 2012 IEEE International Conference on Green Computing and Communications.

[12]  Xiaojun Wan TimedTextRank: adding the temporal dimension to multi-document summarization , 2007, SIGIR.

[13]  Philip S. Yu,et al.  Adding the temporal dimension to search - a case study in publication search , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[14]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[15]  Scott Sanner,et al.  Improving LDA topic models for microblogs via tweet pooling and automatic labeling , 2013, SIGIR.

[16]  Mohand Boughanem,et al.  Featured Tweet Search: Modeling Time and Social Influence for Microblog Retrieval , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[17]  Weiguang Qu,et al.  A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document Semantic Network , 2010, ACL.

[18]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[22]  Sujian Li,et al.  Hypergraph-based inductive learning for generating implicit key phrases , 2011, WWW.

[23]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[24]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[25]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[26]  Philip S. Yu,et al.  Time Sensitive Ranking with Application to Publication Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[27]  Abdelghani Bellaachia,et al.  NE-Rank: A Novel Graph-Based Keyphrase Extraction in Twitter , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[28]  M. de Rijke,et al.  Personalized time-aware tweets summarization , 2013, SIGIR.

[29]  Chen Avin,et al.  Radio cover time in hyper-graphs , 2010, DIALM-POMC '10.