Main Core Retention on Graph-of-Words for Single-Document Keyword Extraction

In this paper, we apply the concept of k-core on the graph-of-words representation of text for single-document keyword extraction, retaining only the nodes from the main core as representative terms. This approach takes better into account proximity between keywords and variability in the number of extracted keywords through the selection of more cohesive subsets of nodes than with existing graph-based approaches solely based on centrality. Experiments on two standard datasets show statistically significant improvements in F1-score and AUC of precision/recall curve compared to baseline results, in particular when weighting the edges of the graph with the number of co-occurrences. To the best of our knowledge, this is the first application of graph degeneracy to natural language processing and information retrieval.

[1]  Mark Last,et al.  Graph-Based Keyword Extraction for Single-Document Summarization , 2008, COLING 2008.

[2]  Katja Filippova,et al.  Multi-Sentence Compression: Finding Shortest Paths in Word Graphs , 2010, COLING.

[3]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[4]  Dimitrios M. Thilikos,et al.  D-cores: measuring collaboration of directed graphs based on degeneracy , 2011, Knowledge and Information Systems.

[5]  G. McLachlan,et al.  Advances in Data Analysis and Classification , 2015 .

[6]  Guy Shani,et al.  Leveraging the citation graph to recommend keywords , 2013, RecSys.

[7]  B. Bollobás,et al.  Extremal Graph Theory , 2013 .

[8]  Marko Grobelnik,et al.  Learning Sub-structures of Document Semantic Graphs for Document Summarization , 2004 .

[9]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[10]  Hugh E. Williams,et al.  Fast generation of result snippets in web search , 2007, SIGIR.

[11]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[12]  Peter D. Turney Learning to Extract Keyphrases from Text , 2002, ArXiv.

[13]  Christina Lioma,et al.  Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[14]  Julia Hirschberg,et al.  Do summaries help? , 2005, SIGIR '05.

[15]  Stephen B. Seidman,et al.  Network structure and minimum degree , 1983 .

[16]  Michalis Vazirgiannis,et al.  Keeping keywords fresh: a BM25 variation for personalized keyword extraction , 2012, TempWeb '12.

[17]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[18]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[19]  Bassiou Nikoletta,et al.  Word Clustering Using PLSA Enhanced with Long Distance Bigrams , 2010, 2010 20th International Conference on Pattern Recognition.

[20]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[21]  Vladimir Batagelj,et al.  Fast algorithms for determining (generalized) core groups in social networks , 2011, Adv. Data Anal. Classif..

[22]  George A. Vouros,et al.  Summarization system evaluation revisited: N-gram graphs , 2008, TSLP.

[23]  Michalis Vazirgiannis,et al.  Graph-of-word and TW-IDF: new approach to ad hoc IR , 2013, CIKM.

[24]  Maurizio Marchese,et al.  Large Dataset for Keyphrases Extraction , 2009 .

[25]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[26]  J. Leskovec,et al.  Learning Semantic Graph Mapping for Document Summarization , 2004 .

[27]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[28]  Andreas Stolcke,et al.  Web resources for language modeling in conversational speech recognition , 2007, TSLP.

[29]  Constantine Kotropoulos,et al.  Word Clustering Using PLSA Enhanced with Long Distance Bigrams , 2010, ICPR.