We present SumBasic+, a powerful multi-document summarization system built from first principles. SumBasic+ is designed as a baseline system to gauge the level of summarization results we could obtain using simple statistical techniques. Our extractive summarization system is based on word frequency statistics similar to the SumBasic method. Nevertheless, we were able to considerably improve its summarization performance by tuning the amount and type of redundancy removal performed, adding a simple query-focused summarization component, and by employing a number of pre- and post-processing compression techniques. The resulting system, SumBasic+, is a strong baseline system that is ideal for comparing with new summarization approaches, as it principally uses existing techniques and per- forms surprisingly well. Of 43 competing systems in the TAC 2010 summarization track, our system achieved fourth and third place in R-2 and R-SU4 ROUGE scores respectively, and second overall in the manual average pyramid evaluation for the initial summaries.
[1]
Daniel Gillick,et al.
Sentence Boundary Detection and the Problem with the U.S.
,
2009,
NAACL.
[2]
Daniel Marcu,et al.
Bayesian Query-Focused Summarization
,
2006,
ACL.
[3]
Regina Barzilay,et al.
Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization
,
2004,
NAACL.
[4]
Chin-Yew Lin,et al.
ROUGE: A Package for Automatic Evaluation of Summaries
,
2004,
ACL 2004.
[5]
Lucy Vanderwende,et al.
Exploring Content Models for Multi-Document Summarization
,
2009,
NAACL.
[6]
Hans Peter Luhn,et al.
The Automatic Creation of Literature Abstracts
,
1958,
IBM J. Res. Dev..
[7]
Tibor Kiss,et al.
Unsupervised Multilingual Sentence Boundary Detection
,
2006,
CL.
[8]
Ani Nenkova,et al.
A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization
,
2006,
SIGIR.