Similarity versus relatedness: A novel approach in extractive Persian document summarisation

Automatic text summarisation is the process of creating a summary from one or more documents by eliminating the details and preserving the worthwhile information. This article presents a single/multi-document summariser using a novel clustering method for creating summaries. First, a feature selection phase is employed. Then, FarsNet, the Persian WordNet, is utilised to extract the semantic information of words. Therefore, the input sentences are categorised into three main clusters: similarity, relatedness and coherency. Each similarity cluster contains similar sentences to its core, while each relatedness cluster contains sentences that are related (but not similar) to its core. The coherency clusters show the sentences that should be kept together to preserve the coherency of the summary. Finally, the centroid of each similarity cluster having the most feature score is added to an empty summary. The summary is enlarged by including related sentences from relatedness clusters and excluding similar sentences to its content iteratively. Coherency clusters are applied to the created summary in the last step. The proposed method has been compared with three known existing text summarisation systems and techniques for the Persian language: FarsiSum, Parsumist and Ijaz. Our proposed method leads to improvement in experimental results on different measurements including precision, recall, F-measure, ROUGE-N and ROUGE-L.

[1]  Wenjie Li,et al.  A spectral analysis approach to document summarization: Clustering and ranking sentences simultaneously , 2011, Inf. Sci..

[2]  Omid Kashefi,et al.  Persian Text Summarization Using Fractal Theory , 2011 .

[3]  Naomie Salim,et al.  Sentence Features Fusion for Text Summarization Using Fuzzy Logic , 2009, 2009 Ninth International Conference on Hybrid Intelligent Systems.

[4]  Horacio Rodríguez,et al.  Support Vector Machines for Query-focused Summarization trained and evaluated on Pyramid data , 2007, ACL.

[5]  Thierry Poibeau,et al.  Automatic Text Summarization: Past, Present and Future , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[6]  Mohsen Kahani,et al.  Pasokh: A standard corpus for the evaluation of Persian text summarizers , 2013, ICCKE 2013.

[7]  Adam Wright,et al.  Summarization of clinical information: A conceptual model , 2011, J. Biomed. Informatics.

[8]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[9]  Mehrnoush Shamsfard,et al.  Persian Document Summarization by Parsumist , 2009 .

[10]  Ying Zhang,et al.  SensCare: Semi-automatic Activity Summarization System for Elderly Care , 2011, MobiCASE.

[11]  Jiawei Han,et al.  Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions , 2010, COLING.

[12]  Laurence Capus,et al.  Insertion of Ontological Knowledge to Improve Automatic Summarization Extraction Methods , 2011, J. Intell. Learn. Syst. Appl..

[13]  Martin Hassel,et al.  FarsiSum - A Persian Text Summarizer , 2004 .

[14]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[15]  Antonio Maria Rinaldi,et al.  Document Summarization Using Semantic Clouds , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[16]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[17]  Farshad Kiyoumarsi,et al.  Optimizing Persian Text Summarization Based on Fuzzy Logic Approach , .

[18]  Maria Soledad Pera,et al.  A naïve Bayes Classifier for Web Document Summaries Created by Using Word Similarity and Significant Factors , 2010, Int. J. Artif. Intell. Tools.

[19]  Dragomir R. Radev,et al.  Scientific Paper Summarization Using Citation Summary Networks , 2008, COLING.

[20]  Ani Nenkova,et al.  A Survey of Text Summarization Techniques , 2012, Mining Text Data.

[21]  Ali Farghaly,et al.  Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages , 2004 .

[22]  Gholamreza Ghassem-Sani,et al.  A Multi-Document Multi-Lingual Automatic Summarization System , 2008, IJCNLP.

[23]  Yen-Liang Chen,et al.  Using decision trees to summarize associative classification rules , 2009, Expert Syst. Appl..

[24]  Mohammad Saniee Abadeh,et al.  Automated Text Summarization Base on Lexicales Chain and graph Using of WordNet and Wikipedia Knowledge Base , 2012, ArXiv.

[25]  Omid Kashefi,et al.  AZOM: A Persian Structured Text Summarizer , 2011, NLDB.

[26]  António Branco,et al.  Extracting Multi-document Summaries with a Double Clustering Approach , 2012, NLDB.

[27]  Dilek Z. Hakkani-Tür,et al.  Clusterrank: a graph based method for meeting summarization , 2009, INTERSPEECH.

[28]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[29]  Jian-Ping Mei,et al.  SumCR: A new subtopic-based extractive approach for text summarization , 2012, Knowledge and Information Systems.

[30]  Mehrnoush Shamsfard,et al.  STeP-1: A Set of Fundamental Tools for Persian Text Processing , 2010, LREC.

[31]  Kahani Mohsen,et al.  Ijaz: An Operational system for single-document summarization of Persian news texts , 2014 .

[32]  Kushal Dave,et al.  Towards Summarization of Written Text Conversations , 2013 .

[33]  Thierry Poibeau,et al.  Multi-source, Multilingual Information Extraction and Summarization , 2012, Theory and Applications of Natural Language Processing.

[34]  Behrouz Minaei-Bidgoli,et al.  A New Hybrid Farsi Text Summarization Technique Based on Term Co-Occurrence and Conceptual Property of the Text , 2008, 2008 Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing.

[35]  Sun Park,et al.  Automatic Multi-document Summarization Based on Clustering and Nonnegative Matrix Factorization , 2010 .

[36]  Mohsen Amini Salehi,et al.  A New Graph-Based Algorithm for Persian Text Summarization , 2012 .

[37]  Mark Steedman,et al.  Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2012 .

[38]  Christophe Rodrigues,et al.  Combining a Multi-Document Update Summarization System –CBSEAS– with a Genetic Algorithm , 2011 .

[39]  Takeshi Abekawa,et al.  Framework of Automatic Text Summarization Using Reinforcement Learning , 2012, EMNLP-CoNLL.

[40]  Laura Plaza,et al.  Using Semantic Graphs and Word Sense Disambiguation Techniques to Improve Text Summarization , 2011, Proces. del Leng. Natural.