Single-Document Summarization as a Tree Knapsack Problem

Recent studies on extractive text summarization formulate it as a combinatorial optimization problem such as a Knapsack Problem, a Maximum Coverage Problem or a Budgeted Median Problem. These methods successfully improved summarization quality, but they did not consider the rhetorical relations between the textual units of a source document. Thus, summaries generated by these methods may lack logical coherence. This paper proposes a single document summarization method based on the trimming of a discourse tree. This is a two-fold process. First, we propose rules for transforming a rhetorical structure theorybased discourse tree into a dependency-based discourse tree, which allows us to take a treetrimming approach to summarization. Second, we formulate the problem of trimming a dependency-based discourse tree as a Tree Knapsack Problem, then solve it with integer linear programming (ILP). Evaluation results showed that our method improved ROUGE scores.

[1]  M. Rey Improving summarization through rhetorical parsing tuning , 1998 .

[2]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[3]  Michael Strube,et al.  Dependency Tree Based Sentence Compression , 2008, INLG.

[4]  Daniel Marcu,et al.  Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001, SIGDIAL Workshop.

[5]  Hiroya Takamura,et al.  Text Summarization Model Based on Maximum Coverage Problem and its Variant , 2009, EACL.

[6]  Vasileios Hatzivassiloglou,et al.  A Formal Model for Information Selection in Multi-Sentence Text Extraction , 2004, COLING.

[7]  Hiroya Takamura,et al.  Text summarization model based on the budgeted median problem , 2009, CIKM.

[8]  Joseph A. Lukes Efficient Algorithm for the Partitioning of Trees , 1974, IBM J. Res. Dev..

[9]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[10]  Ryan T. McDonald A Study of Global Inference Algorithms in Multi-document Summarization , 2007, ECIR.

[11]  Mitsuru Ishizuka,et al.  HILDA: A Discourse Parser Using Support Vector Machine Classification , 2010, Dialogue Discourse.

[12]  N. Samphaiboon,et al.  Heuristic and Exact Algorithms for the Precedence-Constrained Knapsack Problem , 2000 .

[13]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[14]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[15]  Helmut Prendinger,et al.  A Novel Discourse Parser Based on Support Vector Machine Classification , 2009, ACL.

[16]  Daniel Marcu,et al.  A Noisy-Channel Model for Document Compression , 2002, ACL.

[17]  Geon Cho,et al.  A Depth-First Dynamic Programming Algorithm for the Tree Knapsack Problem , 1997, INFORMS J. Comput..

[18]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.