A Rule-Based Extensible Stemmer for Information Retrieval with Application to Arabic

This paper presents a new and extensible method for information retrieval and content analysis in natural languages (NL). The proposed method is stem-based; stems are extracted based on a set of language dependent rules that are interpreted by a rule engine. The rule engine allows the system to be adapted to any natural language by modifying the NL semantic rules and grammar. The system has been fully tested using Arabic, and partially using English, Hebrew, and Persian. We validate our approach using a database-based prototype.

[1]  Pr. Mohamed Hassoun,et al.  On lemmatization in Arabic , A formal definition of the Arabic entries of multilingual lexical databases , 2001 .

[2]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[3]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[4]  Gerard Salton,et al.  Another look at automatic text-retrieval systems , 1986, CACM.

[5]  Kenneth R. Beesley Arabic Finite-State Morphological Analysis and Generation , 1996, COLING.

[6]  Michael F. Lynch,et al.  Stemming and N-gram matching for term conflation in Turkish texts , 1996, Information Research.

[7]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[8]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[9]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[10]  Clement T. Yu,et al.  Effective Automatic Indexing Using Term Addition and Deletion , 1978, JACM.

[11]  Peter Willett,et al.  Processing morphological variants in searches of Latin text , 1996, Information Research.

[12]  Isabelle Moulinier,et al.  West Group at CLEF2000: Non-English Monolingual Retrieval , 2000, CLEF.

[13]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.