Using Stemming in Morphological Analysis to Improve Arabic Information Retrieval

Information retrieval (IR) consists in finding all relevant documents for a user query in a collection of documents. These documents are ordered by the probability of being relevant to the user’s query. The highest ranked document is considered to be the most likely relevant document. Natural Language Processing (NLP) for IR aims to transform the potentially ambiguous words of queries and documents into unambiguous internal representations on which matching and retrieval can take place. This transformation is generally achieved by several levels of linguistic analysis, morphological, syntactic and so forth. In this paper, we present the Arabic linguistic analyzer used in the LIC2M cross-lingual search engine. We focus on the morphological analyzer and particularly the clitic stemmer which segments the input words into proclitics, simple forms and enclitics. We demonstrate that stemming improves search engine recall and precision.