Learning Similarity Functions in Graph-Based Document Summarization

Graph-based models have been extensively explored in document summarization in recent years. Compared with traditional feature-based models, graph-based models incorporate interrelated information into the ranking process. Thus, potentially they can do a better job in retrieving the important contents from documents. In this paper, we investigate the problem of how to measure sentence similarity which is a crucial issue in graph-based summarization models but in our belief has not been well defined in the past. We propose a supervised learning approach that brings together multiple similarity measures and makes use of human-generated summaries to guide the combination process. Therefore, it can be expected to provide more accurate estimation than a single cosine similarity measure. Experiments conducted on the DUC2005 and DUC2006 data sets show that the proposed learning approach is successful in measuring similarity. Its competitiveness and adaptability are also demonstrated.

[1]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[2]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[3]  Ilyas Cicekli,et al.  Using lexical chains for keyword extraction , 2007, Inf. Process. Manag..

[4]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[5]  Dragomir R. Radev,et al.  LexPageRank: Prestige in Multi-Document Text Summarization , 2004, EMNLP.

[6]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[7]  Dragomir R. Radev,et al.  Using Random Walks for Question-focused Sentence Retrieval , 2005, HLT.

[8]  Rada Mihalcea,et al.  A Language Independent Algorithm for Single and Multiple Document Summarization , 2005, IJCNLP.

[9]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[10]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[11]  Wenjie Li,et al.  Developing learning strategies for topic-based summarization , 2007, CIKM '07.

[12]  Hongyuan Zha,et al.  Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[13]  Xiaojun Wan,et al.  Using Cross-Document Random Walks for Topic-Focused Multi-Document , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[14]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[15]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[16]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[17]  C. J. van Rijsbergen,et al.  Query-Sensitive Similarity Measures for Information Retrieval , 2003, Knowledge and Information Systems.

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[20]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .