OCELOT: a system for summarizing Web pages

We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by “translating” it into a more concise representation rather than attempting to extract a text span verbatim. OCELOT uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.

[1]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[2]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[3]  John D. Lafferty,et al.  The Weaver System for Document Retrieval , 1999, TREC.

[4]  H. P. Edmundson,et al.  Problems in automatic abstracting , 1964, CACM.

[5]  Hiroshi Maruyama,et al.  Real-time on-line unconstrained handwriting recognition using statistical methods , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[7]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[8]  Richard Shillcock,et al.  Proceedings of EUROSPEECH-1991. , 1991 .

[9]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[10]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[11]  James E. Rush,et al.  Improvement of automatic abstracts by the use of structural analysis , 1973, J. Am. Soc. Inf. Sci..

[12]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[13]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[14]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[15]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[16]  Therese Firmin Hand,et al.  A Proposal for Task-based Evaluation of Text Summarization Systems , 1997, Workshop On Intelligent Scalable Text Summarization.

[17]  John D. Lafferty,et al.  The Candide System for Machine Translation , 1994, HLT.

[18]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[19]  Vibhu O. Mittal,et al.  Ultra-summarization (poster abstract): a statistical approach to generating highly condensed non-extractive summaries , 1999, SIGIR '99.