Hierarchical summarization of large documents

Many automatic text summarization models have been developed in the last decades. Related research in information science has shown that human abstractors extract sentences for summaries based on the hierarchical structure of documents; however, the existing automatic summarization models do not take into account the human abstractor's behavior of sentence extraction and only consider the document as a sequence of sentences during the process of extraction of sentences as a summary. In general, a document exhibits a well-defined hierarchical structure that can be described as fractals—mathematical objects with a high degree of redundancy. In this article, we introduce the fractal summarization model based on the fractal theory. The important information is captured from the source document by exploring the hierarchical structure and salient features of the document. A condensed version of the document that is informatively close to the source document is produced iteratively using the contractive transformation in the fractal theory. The fractal summarization model is the first attempt to apply fractal theory to document summarization. It significantly improves the divergence of information coverage of summary and the precision of summary. User evaluations have been conducted. Results have indicated that fractal summarization is promising and outperforms current summarization techniques that do not consider the hierarchical structure of documents. © 2008 Wiley Periodicals, Inc.

[1]  Phyllis B. Baxendale,et al.  Machine-Made Index for Technical Literature - An Experiment , 1958, IBM J. Res. Dev..

[2]  Marc Moens,et al.  Sentence extraction and rhetorical classification for flexible abstracts , 1998 .

[3]  Brigitte Endres-Niggemeyer,et al.  How to Implement a Naturalistic Model of Abstracting: Four Core Working Steps of an Expert Abstractor , 1995, Inf. Process. Manag..

[4]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[5]  Christopher C. Yang,et al.  Multi-document Summarization for Terrorism Information Extraction , 2006, ISI.

[6]  Massih-Reza Amini,et al.  The use of unlabeled data to improve supervised learning for text summarization , 2002, SIGIR '02.

[7]  Gareth J. F. Jones,et al.  Applying summarization techniques for term selection in relevance feedback , 2001, SIGIR '01.

[8]  George M. Kasper,et al.  The Effects and Limitations of Automated Text Condensing on Reading Comprehension Performance , 1992, Inf. Syst. Res..

[9]  Hideki Koike,et al.  Fractal views: a fractal-based method for controlling information display , 1995, TOIS.

[10]  Brigitte Endres-Niggemeyer,et al.  SimSum: an empirically founded simulation of summarizing , 2000, Inf. Process. Manag..

[11]  Seiji Miike,et al.  Abstract Generation Based on Rhetorical Structure Extraction , 1994, COLING.

[12]  Mark Allen Weiss,et al.  Data structures and algorithm analysis in Ada , 1993 .

[13]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[14]  Yuji Matsumoto,et al.  A new approach to unsupervised text summarization , 2001, SIGIR '01.

[15]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[16]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[17]  Tomek Strzalkowski,et al.  A Robust Practical Text Summarization , 1998 .

[18]  Hitoshi Isahara,et al.  A Summarization System with Categorization of Document Sets , 2002, NTCIR.

[19]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[20]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[21]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[22]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[23]  Ellen M. Voorhees,et al.  The fifth text REtrieval conference (TREC-5) , 1997 .

[24]  Daniel Marcu,et al.  From discourse structures to text summaries , 1997 .

[25]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[26]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[27]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[28]  Kiyosi Itô Encyclopedic dictionary of mathematics (2nd ed.) , 1993 .

[29]  A. Jacquin Fractal image coding: a review , 1993, Proc. IEEE.

[30]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[31]  G Salton,et al.  Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts , 1994, Science.

[32]  Chris Buckley,et al.  Automatic Text Summarization by Paragraph Extraction , 1997 .

[33]  Seiji Miike,et al.  A full-text retrieval system with a dynamic abstract generation function , 1994, SIGIR '94.

[34]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[35]  Kathleen R. McKeown,et al.  Generating natural language summaries from multiple on-line sources , 1998 .

[36]  Hsinchun Chen,et al.  Visualization of large category map for Internet browsing , 2003, Decis. Support Syst..

[37]  Kathleen R. McKeown,et al.  Summarization Evaluation Methods: Experiments and Analysis , 1998 .

[38]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[39]  Simone Teufel,et al.  Sentence extraction as a classification task , 1997 .

[40]  Eduard H. Hovy,et al.  Identifying Topics by Position , 1997, ANLP.

[41]  A. Strauss,et al.  The discovery of grounded theory: strategies for qualitative research aldine de gruyter , 1968 .

[42]  Christopher C. Yang,et al.  Automatic Summarization of Chinese and English , 2003, ICADL.