A web statistics based conflation approach to improve Arabic text retrieval

We present a language independent approach for conflation that does not depend on predefined rules or prior knowledge of the target language. The proposed unsupervised method is based on an enhancement of the pure n-gram model that is used to group related words based on a revised string-similarity measure. In order to detect and eliminate terms that are created by this process, but that are most likely not relevant for the query (”noisy terms”), an approach based on mutual information scores computed based on web statistical cooccurrences data is proposed. Furthermore, an evaluation of this approach is presented.

[1]  Andreas Nürnberger,et al.  Evaluation of n-gram conflation approaches for Arabic text retrieval , 2009, J. Assoc. Inf. Sci. Technol..

[2]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[3]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[4]  Douglas W. Oard,et al.  Term selection for searching printed Arabic , 2002, SIGIR '02.

[5]  S. Kosinov Evaluation of N-grams Conflation Approach in Text-Based Information Retrieval , 2001, SPIRE.

[6]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[7]  Peter Willett,et al.  Processing morphological variants in searches of Latin text , 1996, Information Research.

[8]  Tetsuya Ishikawa,et al.  Extracting Loanwords from Mongolian Corpora and Producing a Japanese-Mongolian Bilingual Dictionary , 2006, ACL.

[9]  Stefan Bordag,et al.  Unsupervised Knowledge-Free Morpheme Boundary Detection , 2005 .

[10]  Suleiman H. Mustafa Character contiguity in N-gram-based word matching: the case for Arabic text searching , 2005, Inf. Process. Manag..

[11]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[12]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[13]  Andreas Nürnberger,et al.  multi Searcher: can we support people to get information from text they can't read or understand? , 2010, SIGIR '10.

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Lisa Ballesteros,et al.  Light Stemming for Arabic Information Retrieval , 2007 .

[16]  Andreas Nürnberger,et al.  Supporting Arabic Cross-Lingual Retrieval Using Contextual Information , 2011, IRFC.

[17]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[18]  Stéphane Bressan,et al.  Indexing the Indonesian Web: Language Identification and Miscellaneous Issues , 2001, WWW Posters.

[19]  Anne N. De Roeck,et al.  A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots , 2000, ACL.

[20]  Christine D. Piatko,et al.  JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval , 2001, TREC.

[21]  HANI ABU-SALEM Comparison of Stemming and N-gram Matching for Term Conflation in Arabic Text , 2004, Int. J. Comput. Process. Orient. Lang..

[22]  Andrew Large,et al.  Information Retrieval from Full-Text Arabic Databases: Can Search Engines Designed for English Do the Job? , 2001 .

[23]  Chafic Mokbel,et al.  On the use of morphological constraints in n-gram statistical language model , 2005, INTERSPEECH.

[24]  Ola Knutsson,et al.  Improving Precision in Information Retrieval for Swedish using Stemming , 2001, NODALIDA.