A Novel Chinese Text Summarization Approach Using Sentence Extraction Based on Kernel Words Recognition

The continuing growth of world wide Web and on-line text collections makes a large volume of information available to users. Automatic text summarization helps users to quickly understand the documents. This paper proposes an automated technique for Chinese document summarization based on kernel words recognition and discourse segment extraction. This method can be divided into the following five steps. First, the input articles are annotated by lexical analysis. Second, all focused named entities are recognized using a machine learning method. Third, the input articles are divided into several discourse segments, all kernel words of these segments are extracted by the way of rule-based main verbs recognition, and all relations among entities are extracted. Fourth, all important sentence candidates are ranked based on some rules, and redundant sentences are removed based on kernel words information. Finally, several most important sentences are extracted to compose the summarization according to expected compression ratio, and these important sentences are output using a special document as reference. A series of experiments are performed on two Chinese document collections. The results show the superiority of the proposed technique over reference systems.

[1]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[2]  Li Zhang,et al.  Focused named entity recognition using machine learning , 2004, SIGIR '04.

[3]  Min Zhao,et al.  Ranking definitions with supervised learning methods , 2005, WWW '05.

[4]  Inderjeet Mani,et al.  Summarization Evaluation: An Overview , 2001, NTCIR.

[5]  Chin-Yew Lin Training a selection function for extraction , 1999, CIKM '99.

[6]  Chen Qun-xiu Research on Automatic Summarization Based on Rules and Statistics for Chinese Texts , 2006 .

[7]  Dragomir R. Radev,et al.  Sub-event based multi-document summarization , 2003, HLT-NAACL 2003.

[8]  Dong-Hong Ji,et al.  A Novel Chinese Multi-Document Summarization Using Clustering Based Sentence Extraction , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[9]  Wei-Pang Yang,et al.  Text summarization using a trainable summarizer and latent semantic analysis , 2005, Inf. Process. Manag..

[10]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[11]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[12]  Li Sheng,et al.  Sentences Optimum Selection for Multi-Document Summarization , 2006 .

[13]  Eduard Hovy,et al.  Automated Text Summarization in SUMMARIST , 1997, ACL 1997.

[14]  Amit P. Sheth,et al.  Semantic Association Identification and Knowledge Discovery for National Security Applications , 2005, J. Database Manag..

[15]  Meng Wang,et al.  A study of Chinese text summarization using adaptive clustering of paragraphs , 2004, The Fourth International Conference onComputer and Information Technology, 2004. CIT '04..

[16]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.