A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots

We present a clustering algorithm for Arabic words sharing the same root. Root based clusters can substitute dictionaries in indexing for IR. Modifying Adamson and Boreham (1974), our Two-stage algorithm applies light stemming before calculating word pair similarity coefficients using techniques sensitive to Arabic morphology. Tests show a successful treatment of infixes and accurate clustering to up to 94.06% for unedited Arabic text samples, without the use of dictionaries.

[1]  Tarek A. El-Sadany,et al.  An Arabic Morphological System , 1989, IBM Syst. J..

[2]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[3]  Martha W. Evens,et al.  Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System , 1994, J. Am. Soc. Inf. Sci..

[4]  Martha W. Evens,et al.  Stemming methodologies over individual query words for an Arabic information retrieval system , 1999 .

[5]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[6]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[7]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[8]  Kevin Knight,et al.  Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[9]  Martha W. Evens,et al.  Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System , 1999, J. Am. Soc. Inf. Sci..

[10]  Riyad Al-Shalabi,et al.  A Computational Morphology System for Arabic , 1998, SEMITIC@COLING.

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[13]  Elizabeth Shaw Adams A study of trigrams and their feasibility as index terms in a full text information retrieval system , 1992 .

[14]  Ismail Hmeidi,et al.  Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents , 1997, J. Am. Soc. Inf. Sci..

[15]  George Anton Kiraz Multi-Tape Two-Level Morphology: A Case Study in Semitic Non-linear Morphology , 1994, COLING.

[16]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[17]  Peter Willett,et al.  Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods , 1992, SIGIR '92.