Graph-Based Keyword Extraction for Single-Document Summarization

In this paper, we introduce and compare between two novel approaches, supervised and unsupervised, for identifying the keywords to be used in extractive summarization of text documents. Both our approaches are based on the graph-based syntactic representation of text and web documents, which enhances the traditional vector-space model by taking into account some structural document features. In the supervised approach, we train classification algorithms on a summarized collection of documents with the purpose of inducing a keyword identification model. In the unsupervised approach, we run the HITS algorithm on document graphs under the assumption that the top-ranked nodes should represent the document keywords. Our experiments on a collection of benchmark summaries show that given a set of summarized training documents, the supervised classification provides the highest keyword identification accuracy, while the highest F-measure is reached with a simple degree-based ranking. In addition, it is sufficient to perform only the first iteration of HITS rather than running it to its convergence.

[1]  Abraham Kandel,et al.  Graph-Theoretic Techniques for Web Content Mining , 2005, Series in Machine Perception and Artificial Intelligence.

[2]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Abraham Kandel,et al.  Fast Categorization of Web Documents Represented by Graphs , 2006, WEBKDD.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Mark Last,et al.  Identification of Terrorist Web Sites with Cross-Lingual Classification Tools , 2005, Fighting Terror in Cyberspace.

[7]  Taher H. Haveliwala,et al.  Adaptive methods for the computation of PageRank , 2004 .

[8]  Ian Witten,et al.  Data Mining , 2000 .

[9]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Elizabeth D. Liddy,et al.  Advances in Automatic Text Summarization , 2001, Information Retrieval.

[12]  Masaki Murata,et al.  Sentence Extraction System Assembling Multiple Evidence , 2001, NTCIR.

[13]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[14]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[15]  J. Leskovec,et al.  Learning Semantic Graph Mapping for Document Summarization , 2004 .

[16]  Massimo Melucci,et al.  Vector Space Model , 2019, Syntactic n-grams in Computational Linguistics.