论文信息 - Multi-document arabic text summarisation

Multi-document arabic text summarisation

In this paper we present our generic extractive Arabic and English multi-document summarisers. We also describe the use of machine translation for evaluating the generated Arabic multi-document summaries using English extractive gold standards. In this work we first address the lack of Arabic multi-document corpora for summarisation and the absence of automatic and manual Arabic gold-standard summaries. These are required to evaluate any automatic Arabic summarisers. Second, we demonstrate the use of Google Translate in creating an Arabic version of the DUC-2002 dataset. The parallel Arabic/English dataset is summarised using the Arabic and English summarisation systems. The automatically generated summaries are evaluated using the ROUGE metric, as well as precision and recall. The results we achieve are compared with the top five systems in the DUC-2002 multi-document summarisation task.

[1] Zheng Xiang,et al. Assessing the Initial Step in the Persuasion Process: META Tags on Destination Marketing Websites , 2006, ENTER.

[2] Brendan T. O'Connor,et al. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[3] Mohamed S. Kamel,et al. Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4] Daniel Dominic Sleator,et al. Parsing English with a Link Grammar , 1995, IWPT.

[5] George Giannakopoulos,et al. AutoSummENG and MeMoG in Evaluating Guided Summaries , 2011, TAC.

[6] Kalina Bontcheva,et al. Ontological Integration of Information Extracted from Multiple Sources , 2007 .

[7] Manabu Okumura,et al. A Comparison of Summarization Methods Based on Task-based Evaluation , 2000, LREC.

[8] Udo Kruschwitz,et al. Assessing Crowdsourcing Quality through Objective Tasks , 2012, LREC.

[9] Daniel Lemire,et al. Faster retrieval with a two-pass dynamic-time-warping lower bound , 2008, Pattern Recognit..

[10] Alessandro Giuliani,et al. Using Snippets in Text Summarization: a Comparative Study and an Application , 2012, IIR.

[11] Udo Kruschwitz,et al. Using Mechanical Turk to Create a Corpus of Arabic Summaries , 2010 .

[12] Daniel Marcu,et al. Statistics-Based Summarization - Step One: Sentence Compression , 2000, AAAI/IAAI.

[13] James Allan,et al. Temporal summaries of new topics , 2001, SIGIR '01.

[14] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15] Lucy Vanderwende,et al. Enhancing Single-Document Summarization by Combining RankNet and Third-Party Sources , 2007, EMNLP.

[16] Eustache Diemert,et al. Unsupervised query categorization using automatically-built concept graphs , 2009, WWW '09.

[17] Lisa F. Rau,et al. Automatic Condensation of Electronic Publications by Sentence Selection , 1995, Inf. Process. Manag..

[18] Shuhua Liu,et al. Towards Fast Digestion of IMF Staff Reports with Automated Text Summarization Systems , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[19] Lamia Hadrich Belguith,et al. Digital Learning for Summarizing Arabic Documents , 2010, IceTAL.

[20] Mahmoud El-Haj. Experimenting with automatic summarization of Arabic text , 2008 .

[21] Slava M. Katz. Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[22] Nicolas Usunier,et al. A Contextual Query Expansion Approach by Term Clustering for Robust Text Summarization , 2007 .

[23] Hongyuan Zha,et al. Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering , 2002, SIGIR '02.

[24] Barry Smyth,et al. From social bookmarking to social summarization: an experiment in community-based summary generation , 2007, IUI '07.

[25] Eduard H. Hovy,et al. Automated Text Summarization and the SUMMARIST System , 1998, TIPSTER.

[26] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[27] Sujian Li,et al. Multi-document Summarization Using Support Vector Regression , 2007 .

[28] Tetsuya Sakai,et al. Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering , 2011, ACL.

[29] Günes Erkan. Using Biased Random Walks for Focused Summarization , 2006 .

[30] Vibhu O. Mittal,et al. Query-Relevant Summarization using FAQs , 2000, ACL.

[31] Eric Atwell,et al. The design of a corpus of Contemporary Arabic , 2006 .

[32] Marie-Francine Moens,et al. K.U.Leuven summarization system at DUC 2004 , 2004 .

[33] Farshad Fotouhi,et al. Augmenting the power of LSI in text retrieval: Singular value rescaling , 2008, Data Knowl. Eng..

[34] Jin Zhang,et al. ICTCAS's ICTGrasper at TAC 2008: Summarizing Dynamic Information with Signature Terms Based Content Filtering , 2008, TAC.

[35] W. Bruce Croft,et al. Search Engines - Information Retrieval in Practice , 2009 .

[36] Yau-Hwang Kuo,et al. Fuzzy-Rough Set Aided Sentence Extraction Summarization , 2006, First International Conference on Innovative Computing, Information and Control - Volume I (ICICIC'06).

[37] Udo Kruschwitz,et al. Exploring Clustering for Multi-document Arabic Summarisation , 2011, AIRS.

[38] Wei-Pang Yang,et al. iSpreadRank: Ranking sentences for extraction-based summarization using feature weight propagation in the sentence similarity network , 2008, Expert Syst. Appl..

[39] Peter Nordin,et al. Genetic programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications , 1998 .

[40] Feifan Liu,et al. Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries , 2008, ACL.

[41] Kenneth R. Beesley,et al. Arabic Morphology Using Only Finite-State Operations , 1998, SEMITIC@COLING.

[42] Mohammed Attia,et al. Arabic Tokenization System , 2007, SEMITIC@ACL.

[43] Ahmet Aker,et al. Multi-Document Summarization Using A* Search and Discriminative Learning , 2010, EMNLP.

[44] Balaraman Ravindran,et al. A probabilistic approach to multi-document summarization for generating a tiled summary , 2005 .

[45] Chuleerat Jaruskulchai,et al. Generic text summarization using local and global properties of sentences , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[46] Kathleen R. McKeown,et al. Automatic text summarization as applied to information retrieval: using indicative and informative summaries , 2003 .

[47] Young-Koo Lee,et al. Applying context summarization techniques in pervasive computing systems , 2006, The Fourth IEEE Workshop on Software Technologies for Future Embedded and Ubiquitous Systems, and the Second International Workshop on Collaborative Computing, Integration, and Assurance (SEUS-WCCIA'06).

[48] Dragomir R. Radev,et al. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[49] Jessica Lin,et al. Towards an error-free Arabic stemming , 2008, iNEWS '08.

[50] Olga Vechtomova,et al. Comparison of models based on summaries or documents towards extraction of update summaries , 2007 .

[51] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[52] Lucy Vanderwende,et al. at DUC 2006 : Task-Focused Summarization with Sentence Simplification and Lexical Expansion , 2006 .

[53] Hans-Peter Frei,et al. Concept based query expansion , 1993, SIGIR.

[54] Bassam Haddad,et al. A Compositional Approach Towards Semantic Representation and Construction of ARABIC , 2005, LACL.

[55] Yoshimi Suzuki,et al. Eliminating Redundancy by Spectral Relaxation for Multi-Document Summarization , 2010, TextGraphs@ACL.

[56] Christophe Rodrigues,et al. Combining a Multi-Document Update Summarization System –CBSEAS– with a Genetic Algorithm , 2011 .

[57] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[58] Ralf Krestel,et al. Generating Update Summaries for DUC 2007 , 2007, HLT-NAACL 2007.

[59] William C. Mann,et al. Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[60] Brian Everitt,et al. Cluster analysis , 1974 .

[61] L. R. Dice. Measures of the Amount of Ecologic Association Between Species , 1945 .

[62] Andrew Hickl,et al. LCC's GISTexter at DUC 2007: Machine Reading for Update Summarization , 2007 .

[63] Yurdaer N. Doganata,et al. Summarizing technical support documents for search: Expert and user studies , 2004, IBM Syst. J..

[64] Mahmoud Gaafar,et al. Arabic Verbs and Essentials of Grammar: A Practical Guide to the Mastery of Arabic , 1997 .

[65] Jerome L. Myers,et al. Research Design and Statistical Analysis , 1991 .

[66] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[67] Noriko Kando,et al. Multi-Document Summarization with Subjectivity Analysis at DUC 2005 , 2005 .

[68] Julie Beth Lovins,et al. Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[69] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[70] Ophir Frieder,et al. Information Retrieval: Algorithms and Heuristics , 1998 .

[71] Eric Atwell,et al. aConCorde: Towards an open-source, extendable concordancer for Arabic , 2006 .

[72] Rodney D. Nielsen. Question Generation : Proposed Challenge Tasks and Their Evaluation , 2008 .

[73] Hoa Trang Dang,et al. Overview of the TAC 2008 Update Summarization Task , 2008, TAC.

[74] Stephen Wan,et al. Experimenting with Clause Segmentation for Text Summarization , 2008, TAC.

[75] Kadri Hacioglu,et al. Automatic Processing of Modern Standard Arabic Text , 2007 .

[76] Nizar Habash,et al. Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition , 2011, ACL.

[77] David West,et al. UNC-CH at DUC 2007: Query Expansion, Lexical Simplification and Sentence Selection Strategies for Multi-Document Summarization , 2007 .

[78] Quan Zhou. IS_SUM: A Multi-Document Summarizer based on Document Index Graphic and Lexical Chains , 2005 .

[79] Ismail Hmeidi,et al. A novel approach to the extraction of roots from Arabic words using bigrams , 2010 .

[80] Scott P. Robertson,et al. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , 1991 .

[81] Yue-Shi Lee,et al. Language Model Passage Retrieval for Question-Oriented Multi Document Summarization , 2007 .

[82] Yi Li,et al. NICTA's Update and Question-based Summarisation Systems at DUC 2007 , 2007 .

[83] Dianne P. O'Leary,et al. QCS: A system for querying, clustering and summarizing documents , 2007, Inf. Process. Manag..

[84] B. Magnini,et al. A Keyphrase-Based Approach to Summarization : the LAKE System at DUC-2005 , 2005 .

[85] Frédéric Béchet,et al. The LIA summarization system at DUC-2007 , 2007 .

[86] Tao Li,et al. Weighted consensus multi-document summarization , 2012, Inf. Process. Manag..

[87] Dianne P. O'Leary,et al. Arabic/English Multi-document Summarization with CLASSY - The Past and the Future , 2008, CICLing.

[88] Simon M. Lucas,et al. Sentence-Level Attachment Prediction , 2010, IRFC.

[89] Ping Chen,et al. A Query-Based Medical Information Summarization System Using Ontology Knowledge , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[90] M. Saravanan,et al. A probabilistic approach to multi-document summarization for generating a tiled summary , 2005, Sixth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA'05).

[91] Jack G. Conrad,et al. Thomson Reuters at TAC 2008: Aggressive Filtering with FastSum for Update and Opinion Summarization , 2008, TAC.

[92] Ricardo Baeza-Yates,et al. Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[93] Jonas Sjöbergh,et al. Older versions of the ROUGEeval summarization evaluation system were easier to fool , 2007, Inf. Process. Manag..

[94] Regina Barzilay,et al. Sentence Ordering in Multidocument Summarization , 2001, HLT.

[95] J Allan,et al. Readings in information retrieval. , 1998 .

[96] Udo Kruschwitz,et al. Experimenting with Automatic Text Summarisation for Arabic , 2009, LTC.

[97] Dragomir R. Radev,et al. Generating summaries of multiple news articles , 1995, SIGIR '95.

[98] Lucia Helena Machado Rino,et al. Combining Multiple Features for Automatic Text Summarization through Machine Learning , 2008, PROPOR.

[99] P. Sreenivasa Kumar,et al. Update Summarizer Using MMR Approach , 2008, TAC.

[100] Eric Atwell,et al. Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text , 2010, LREC.

[101] George Giannakopoulos,et al. TAC2011 MultiLing Pilot Overview , 2011, TAC.

[102] Robert L. Donaway,et al. A Comparison of Rankings Produced by Summarization Evaluation Measures , 2000 .

[103] and software — performance evaluation , .

[104] Dragomir R. Radev,et al. Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[105] Rada Mihalcea,et al. Language Independent Extractive Summarization , 2005, ACL.

[106] Martha Evens,et al. Acquisition System for Arabic Noun Morphology , 2002, SEMITIC@ACL.

[107] Rakesh M. Verma,et al. Automated extractive single-document summarization: beating the baselines with a new approach , 2011, SAC.

[108] Alda Lopes Gançarski,et al. Attribute grammar-based interactive system to retrieve information from XML documents , 2006, IEE Proc. Softw..

[109] Manabu Okumura,et al. Supervised automatic evaluation for summarization with voted regression model , 2007, Inf. Process. Manag..

[110] Robert J. Gaizauskas,et al. Using Coreference Chains for Text Summarization , 1999, COREF@ACL.

[111] Chen Wang,et al. TAC 2008 Update Summarization Task of ICL , 2008, TAC.

[112] Ophir Frieder,et al. Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[113] Mario A. Nascimento,et al. Information-content based sentence extraction for text summarization , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[114] Dragomir R. Radev,et al. Multi-document summarization using off the shelf compression software , 2003, HLT-NAACL 2003.

[115] Chafic Mokbel,et al. Arabic Language Resources and Tools for Speech and Natural Language , 2009 .

[116] Juliane House,et al. SMALL PARALLEL CORPORA IN AN ENGLISH-ARABIC TRANSLATION CLASSROOM: NO NEED TO REINVENT THE WHEEL IN THE ERA OF GLOBALIZATION , 2010 .

[117] Tomaz Erjavec,et al. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[118] Yi Guo,et al. A new multi-document summarisation system , 2003, HLT-NAACL 2003.

[119] Jing Li,et al. A Query-Focused Multi-Document Summarizer Based on Lexical Chains , 2007 .

[120] Qi Su,et al. Internet-scale collection of human-reviewed data , 2007, WWW '07.

[121] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[122] Rakesh M. Verma,et al. A Semantic Free-text Summarization System Using Ontology Knowledge , 2007 .

[123] Yuji Matsumoto,et al. A new approach to unsupervised text summarization , 2001, SIGIR '01.

[124] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[125] Rada Mihalcea,et al. Multi-Document Summarization with Iterative Graph-based Algorithms , 2005 .

[126] Yassine Benajiba,et al. Arabic Named Entity Recognition: A Feature-Driven Study , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[127] Eamonn Newman,et al. Comparing Redundancy Removal Techniques for Multi–Document Summarisation , 2004 .

[128] Olivier Ferret,et al. Bag of Senses Versus Bag of Words: Comparing Semantic and Lexical Approaches on Sentence Extraction , 2008, TAC.

[129] Gabriella Kazai,et al. Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking , 2011, SIGIR.

[130] George A. Vouros,et al. Summarization system evaluation revisited: N-gram graphs , 2008, TSLP.

[131] Laila Khreisat,et al. Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study , 2006, DMIN.

[132] Simone Paolo Ponzetto,et al. Cascaded filtering for topic-driven multi-document summarization , 2007 .

[133] Xuanjing Huang,et al. Using query expansion in graph-based approach for query-focused multi-document summarization , 2009, Inf. Process. Manag..

[134] Tingting He,et al. CCNU at TAC 2008: Proceeding on Using Semantic Method for Automated Summarization Yield , 2008, TAC.

[135] Udo Kruschwitz,et al. University of Essex at the TAC 2011 MultiLingual Summarisation Pilot , 2011, TAC.

[136] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[137] John M. Conroy,et al. Back to Basics: CLASSY 2006 , 2006 .

[138] Ossama Emam,et al. Examining the Effect of Improved Context Sensitive Morphology on Arabic Information Retrieval , 2005, SEMITIC@ACL.

[139] Kathleen R. McKeown,et al. Integrating Rhetorical-Semantic Relation Models for Query-Focused Summarization , 2006 .

[140] Otakar Smrz,et al. ElixirFM – Implementation of Functional Arabic Morphology , 2007, SEMITIC@ACL.

[141] John M. Conroy,et al. CLASSY and TAC 2008 Metrics , 2008, TAC.

[142] Michael Gamon,et al. The PYTHY Summarization System: Microsoft Research at DUC 2007 , 2007 .

[143] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[144] H. P. Edmundson,et al. New Methods in Automatic Extracting , 1969, JACM.

[145] Xiaojun Wan,et al. Multi-document summarization using cluster-based link analysis , 2008, SIGIR '08.

[146] Archana Ganapathi,et al. Web analytics and the art of data summarization , 2011, SLAML '11.

[147] Pascale Fung,et al. One story, one flow: Hidden Markov Story Models for multilingual multidocument summarization , 2006, TSLP.

[148] Claire Cardie,et al. Selecting sentences for multidocument summaries using randomized local search , 2002, ACL 2002.

[149] Halil Kilicoglu,et al. Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation , 2009, J. Biomed. Informatics.

[150] Hoa Trang Dang,et al. Overview of DUC 2005 , 2005 .

[151] Brian Roark,et al. Query-focused summarization by supervised sentence ranking and skewed word distributions , 2006 .

[152] Ramiz M. Aliguliyev. A Novel Partitioning-Based Clustering Method and Generic Document Summarization , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops.

[153] Mark J. F. Gales,et al. Morphological decomposition in Arabic ASR systems , 2012, Comput. Speech Lang..

[154] Aniket Kittur,et al. CrowdForge: crowdsourcing complex work , 2011, UIST.

[155] Aqil M. Azmi,et al. A text summarizer for Arabic , 2012, Comput. Speech Lang..

[156] Chafic Mokbel,et al. MEDAR: Collaboration between European and Mediterranean Arabic Partners to Support the Development of Language Technology for Arabic , 2008, LREC.

[157] Christopher J. Fox,et al. A stop list for general text , 1989, SIGF.

[158] Chris Mellish,et al. TOWARDS AN ARABIC UPPER MODEL: A PROPOSAL , 2008 .

[159] Brian Roark,et al. Feature expansion for query-focused supervised sentence ranking , 2007 .

[160] Jaime Carbonell,et al. Multi-Document Summarization By Sentence Extraction , 2000 .

[161] David Reitter,et al. The Embra System at DUC 2005: Query-oriented Multi-document Summarization with a Very Large Latent Semantic Space , 2005 .

[162] Jean-Paul Chilès,et al. Wiley Series in Probability and Statistics , 2012 .

[163] Vasudeva Varma,et al. Sentence Position revisited: A robust light-weight Update Summarization ‘baseline’ Algorithm , 2009 .

[164] Simone Paolo Ponzetto,et al. Generating Update Summaries with Spreading Activation , 2008, TAC.

[165] Walter Daelemans,et al. Reducing Redundancy in Multi-document Summarization Using Lexical Semantic Similarity , 2009 .

[166] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[167] Djoerd Hiemstra,et al. Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002 , 2003, SIGF.

[168] W. Bruce Croft,et al. Cluster-based retrieval using language models , 2004, SIGIR '04.

[169] Rasim M. Alguliyev,et al. Effective summarization method of text documents , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[170] R. Sureka,et al. Automated Trainable Summarizer For Financial Documents , 2006, 2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06).

[171] Béla Bollobás,et al. Modern Graph Theory , 2002, Graduate Texts in Mathematics.

[172] Josef Steinberger,et al. Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation , 2010, CLEF.

[173] Noriko Kando,et al. Opinion-focused Summarization and its Analysis at DUC 2006 , 2006 .

[174] Ahmet Aker,et al. Multi-document summarization using A * search and discriminative training , 2013 .

[175] Jade Goldstein-Stewart,et al. The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[176] Peter Willett,et al. Readings in information retrieval , 1997 .

[177] Yue-Shi Lee,et al. Light-Weight Multi-Document Summarization based on Two-pass re-ranking , 2006 .

[178] Amna A. Al Kaabi,et al. Arabic Light Stemmer : Anew Enhanced Approach , 2005 .

[179] Deborah Caine,et al. Back to the Basics , 2021, Interceram - International Ceramic Review.

[180] Phyllis B. Baxendale,et al. Machine-Made Index for Technical Literature - An Experiment , 1958, IBM J. Res. Dev..

[181] Francine Chen,et al. A trainable document summarizer , 1995, SIGIR '95.

[182] Jimmy J. Lin,et al. Sentence Compression as a Component of a Multi-Document Summarization System , 2006 .