Keyword Extraction from Arabic Documents using Term Equivalence Classes

The rapid growth of the Internet and other computing facilities in recent years has resulted in the creation of a large amount of text in electronic form, which has increased the interest in and importance of different automatic text processing applications, including keyword extraction and term indexing. Although keywords are very useful for many applications, most documents available online are not provided with keywords. We describe a method for extracting keywords from Arabic documents. This method identifies the keywords by combining linguistics and statistical analysis of the text without using prior knowledge from its domain or information from any related corpus. The text is preprocessed to extract the main linguistic information, such as the roots and morphological patterns of derivative words. A cleaning phase is then applied to eliminate the meaningless words from the text. The most frequent terms are clustered into equivalence classes in which the derivative words generated from the same root and the non-derivative words generated from the same stem are placed together, and their count is accumulated. A vector space model is then used to capture the most frequent N-gram in the text. Experiments carried out using a real-world dataset show that the proposed method achieves good results with an average precision of 31% and average recall of 53% when tested against manually assigned keywords.

[1]  Christopher D. Manning,et al.  Better Arabic Parsing: Baselines, Evaluations, and Analysis , 2010, COLING.

[2]  Anette Hulth Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction , 2004 .

[3]  Jonathan D. Cohen,et al.  Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[4]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[5]  Nizar Habash Arabic Natural Language Processing , 2008 .

[6]  Kadri Hacioglu,et al.  Automatic Processing of Modern Standard Arabic Text , 2007 .

[7]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[8]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[9]  Ismail Hmeidi,et al.  Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents , 1997, J. Am. Soc. Inf. Sci..

[10]  Zhiyuan Liu,et al.  Clustering to Find Exemplar Terms for Keyphrase Extraction , 2009, EMNLP.

[11]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[12]  Michael J. Giarlo A Comparative Analysis of Keyword Extraction Techniques , 2006 .

[13]  Ibrahim A. Al-Kharashi,et al.  Arabic morphological analysis techniques: A comprehensive survey , 2004, J. Assoc. Inf. Sci. Technol..

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Tarek El-Shishtawy,et al.  Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques , 2012, ArXiv.

[16]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[17]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[18]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[19]  Jonathan D. Cohen Highlights: language- and domain-independent automatic indexing terms for abstracting , 1995 .

[20]  A. BOUDLAL,et al.  A Morphosyntactic analysis system for Arabic texts , 2010 .

[21]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[22]  Arafat Awajan Multilayer model for Arabic text compression , 2011, Int. Arab J. Inf. Technol..

[23]  Ahmed A. Rafea,et al.  KP-Miner: A keyphrase extraction system for English and Arabic documents , 2009, Inf. Syst..

[24]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.