Multi-Document Summarization from First Principles

We present SumBasic+, a powerful multi-document summarization system built from first principles. SumBasic+ is designed as a baseline system to gauge the level of summarization results we could obtain using simple statistical techniques. Our extractive summarization system is based on word frequency statistics similar to the SumBasic method. Nevertheless, we were able to considerably improve its summarization performance by tuning the amount and type of redundancy removal performed, adding a simple query-focused summarization component, and by employing a number of pre- and post-processing compression techniques. The resulting system, SumBasic+, is a strong baseline system that is ideal for comparing with new summarization approaches, as it principally uses existing techniques and per- forms surprisingly well. Of 43 competing systems in the TAC 2010 summarization track, our system achieved fourth and third place in R-2 and R-SU4 ROUGE scores respectively, and second overall in the manual average pyramid evaluation for the initial summaries.