Chinese Text Summarization Using a Trainable Summarizer and Latent Semantic Analysis

In this paper, two novel approaches are proposed to extract important sentences from a document to create its summary. The first is a corpus-based approach using feature analysis. It brings up three new ideas: 1) to employ ranked position to emphasize the significance of sentence position, 2) to reshape word unit to achieve higher accuracy of keyword importance, and 3) to train a score function by the genetic algorithm for obtaining a suitable combination of feature weights. The second approach combines the ideas of latent semantic analysis and text relationship maps to interpret conceptual structures of a document. Both approaches are applied to Chinese text summarization. The two approaches were evaluated by using a data corpus composed of 100 articles about politics from New Taiwan Weekly, and when the compression ratio was 30%, average recalls of 52.0% and 45.6% were achieved respectively.

[1]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[2]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[3]  M. Sanderson Advances in Automatic Text Summarization edited by Inderjeet Mani and Mark T. Maybury , 2000 .

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  Robert J. Gaizauskas,et al.  Using Coreference Chains for Text Summarization , 1999, COREF@ACL.

[6]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[7]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[8]  Gerard Salton,et al.  Automatic Text Structuring and Summarization , 1997, Inf. Process. Manag..

[9]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[10]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[11]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[12]  Chin-Yew Lin Training a selection function for extraction , 1999, CIKM '99.

[13]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[14]  Inderjeet Mani,et al.  The Challenges of Automatic Summarization , 2000, Computer.

[15]  Dragomir R. Radev,et al.  Generating summaries of multiple news articles , 1995, SIGIR '95.

[16]  Jerome R. Bellegarda,et al.  A novel word clustering algorithm based on latent semantic analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Jae-Hoon Kim,et al.  Korean text summarization using an aggregate similarity , 2000, IRAL '00.