Enhancing retrieval effectiveness of diacritisized Arabic passages using stemmer and thesaurus

In this paper we discuss the enhancement of Arabic passage retrieval for both diacritisized and nondiacritisized text. Most previous work suggested that retrieval start with pre-processing the Arabic text to remove the diacritical marks (short vowels) to unify the text. In most cases, this process causes considerable ambiguity at the word level in the absence of context. However, searching for a word in diacritisized text requires typing and matching all its diacritical marks, which is cumbersome and prevents users from searching and hence retrieving valuable amount of text. The other way around, is to ignore these marks and fall into the problem of ambiguity. In this paper, we propose a passage retrieval approach to search for diacritic and diacritic-less text through query expansion to match a user’s query. We applied a rule-based stemmer and we compiled a huge thesaurus for this purpose. We tested our approach on the scripts of the Quran as an open domain source of diacritisized text using a set of 40 non-diacritical words obtained from testers. The results are presented and the applied approach reveals future directions for search engines.

[1]  Dimitra Vergyri,et al.  Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition , 2005, Speech Commun..

[2]  Karl Ricanek,et al.  A hierarchical approach to facial aging , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[3]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[4]  Fathi Debili,et al.  La langue arabe et l'ordinateur de l'étiquetage gramatical à la voyellation automatique , 2002 .

[5]  Xiaoqiang Luo,et al.  The Impact of Morphological Stemming on Arabic Mention Detection and Coreference Resolution , 2005, SEMITIC@ACL.

[6]  Ya'akov Gal An HMM Approach to Vowel Restoration in Arabic and Hebrew , 2002, SEMITIC@ACL.

[7]  BASSAM HAMMO,et al.  Experimenting with a Question Answering System for the Arabic Language , 2004, Comput. Humanit..

[8]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[9]  Ying Wang,et al.  A study of the effect of term proximity on query expansion , 2006, J. Inf. Sci..

[10]  Ruhi Sarikaya,et al.  Maximum Entropy Based Restoration of Arabic Diacritics , 2006, ACL.

[11]  Ophir Frieder,et al.  A parallel relational database management system approach to relevance feedback in information retrieval , 1999 .

[12]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[13]  Martha W. Evens,et al.  Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System , 1999, J. Am. Soc. Inf. Sci..

[14]  Martha W. Evens,et al.  Stemming methodologies over individual query words for an Arabic information retrieval system , 1999 .

[15]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[16]  Tomek Strzalkowski,et al.  Information Retrieval Using Robust Natural Language Processing , 1992, ACL.

[17]  Tim Buckwalter Issues in Arabic Morphological Analysis , 2007 .

[18]  Ismail Hmeidi,et al.  Design and implementation of automatic indexing for information retrieval with Arabic documents , 1997 .