Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection

There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unit-graph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDU-based summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances.

[1]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[2]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[3]  O. Sornil,et al.  An Automatic Thai Text Summarization Using Topic Sensitive PageRank , 2006, 2006 International Symposium on Communications and Information Technologies.

[4]  Thanaruk Theeramunkong,et al.  News Relation Discovery Based on Association Rule Mining with Combining Factors , 2011, IEICE Trans. Inf. Syst..

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Thanaruk Theeramunkong,et al.  Multi-Stage Automatic NE and PoS Annotation Using Pattern-Based and Statistical-Based Techniques for Thai Corpus Construction , 2013, IEICE Trans. Inf. Syst..

[7]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[8]  Karel Jezek,et al.  Evaluation Measures for Text Summarization , 2012, Comput. Informatics.

[9]  Inderjeet Mani,et al.  Multi-Document Summarization by Graph Search and Matching , 1997, AAAI/IAAI.

[10]  Thanaruk Theeramunkong,et al.  Thai elementary discourse unit analysis and syntactic-based segmentation , 2013 .

[11]  Surapant Meknavin,et al.  Feature-based Thai Word Segmentation , 1997 .

[12]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[13]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[14]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[15]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[16]  G. Meade Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory , 2001 .

[17]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[18]  Xiaojun Wan,et al.  Exploiting neighborhood knowledge for single document summarization and keyphrase extraction , 2010, TOIS.

[19]  O. Sornil,et al.  An Automatic Text Summarization Approach using Content-Based and Graph-Based Characteristics , 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems.

[20]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[21]  Thanaruk Theeramunkong,et al.  Inclusion-Based and Exclusion-Based Approaches in Graph-Based Multiple News Summarization , 2010, KICSS.

[22]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[23]  Chuleerat Jaruskulchai,et al.  A practical text summarizer by paragraph extraction for Thai , 2003, IRAL.