Revision-based generation of natural language summaries providing historical background: corpus-based analysis, design, implementation and evaluation

Automatically summarizing vast amounts of on-line quantitative data with a short natural language paragraph has a wide range of real-world applications. However, this specific task raises a number of difficult issues that are quite distinct from the generic task of language generation: conciseness, complex sentences, floating concepts, historical background, paraphrasing power and implicit content. In this thesis, I address these specific issues by proposing a new generation model in which a first pass builds a draft containing only the essential new facts to report and a second pass incrementally revises this draft to opportunistically add as many background facts as can fit within the space limit. This model requires a new type of linguistic knowledge: revision operations, which specifies the various ways a draft can be transformed in order to concisely accommodate a new piece of information. I present an in-depth corpus analysis of human-written sports summaries that resulted in an extensive set of such revision operations. I also present the implementation, based on functional unification grammars, of the system scSTREAK, which relies on these operations to incrementally generate complex sentences summarizing basketball games. This thesis also contains two quantitative evaluations. The first shows that the new revision-based generation model is far more robust than the one-pass model of previous generators. The second evaluation demonstrates that the revision operations acquired during the corpus analysis and implemented in scSTREAK are, for the most part, portable to at least one other quantitative domain (the stock market). scSTREAK is the first report generator that systematically places the facts which it summarizes in their historical perspective. It is more concise than previous systems thanks to its ability to generate more complex sentences and to opportunistically convey facts by adding a few words to carefully chosen draft constituents. The revision operations on which scSTREAK is based constitute the first set of corpus-based linguistic knowledge geared towards incremental generation. The evaluation presented in this thesis is also the first attempt to quantitatively assess the robustness of a new generation model and the portability of a new type of linguistic knowledge.