Using the Annotated Bibliography as a Resource for Indicative Summarization

We report on a language resource consisting of 2000 annotated bibliography entries, which is being analyzed as part of our research on indicative document summarization. We show how annotated bibliographies cover certain aspects of summarization that have not been well-covered by other summary corpora, and motivate why they constitute an important form to study for information retrieval. We detail our methodology for collecting the corpus, and overview our document feature markup that we introduced to facilitate summary analysis. We present the characteristics of the corpus, methods of collection, and show its use in finding the distribution of types of information included in indicative summaries and their relative ordering within the summaries. Automatic text summarization has largely been synonymous with domain-independent, sentence extraction techniques (for an overview, see Paice (1990)). These approaches have used a battery of indicators such as cue phrases, term frequency, and sentence position to choose sentences to extract and form into a summary. An alternative approach is to collect sample summaries and apply machine learning techniques to identify what types of information are included in a summary, and identify their stylistic, grammatical, and lexical choice characteristics and to generate or regenerate a summary based on these characteristics. In this paper, we examine the first step towards this goal: the collection of an appropriate summary corpus. We focus on annotated bibliography entries, because they are written without reliance on sentence extraction. Futhermore, these entries contain both informative (i.e., details and topics of the resource) as well as indicative (e.g., metadata such as author or purpose) information. We believe that summary texts similar in form to annotated bibliography entries, such as the one shown in Figure 1, can better serve users and replace standard -top sentence or query word in context summaries commonly found in current generation search engines. Our corpus of summaries consists of 2000 annotated bibliography entries collected from various Internet websites using search engines. We first review aspects and dimensions of text summaries, and detail reasons for collecting a corpus of annotated bibliography entries. We follow with details on the collection methodology and a description of our annotation of the entries. We conclude with some current applications of the corpus to automatic text summarization research.