Comparison of different lemmatization approaches for information retrieval on Turkish text collection

In this paper, we compare the performance of different lemmatization approaches for information retrieval over Turkish text collection. A lemma is simply the "dictionary form" of a word and lemmatization is the process of determining the lemma for a given word where different inflected forms of a word can be analyzed as a single item. We compared three different lemmatizer and one fixed length truncation approaches over Turkish text collection. The first one is based on morphological analyzer for Turkish using with finite state language processing technology; another one is Dictionary-based Turkish Lemmatizer (DTL), which uses radix-trie data structure; the third one is a simple dictionary based top-down parser and the last one is truncation of words at fix length. We have assessed the performance of lemmatizers on Bilkent University Milliyet collection, which contains more than 400K documents. The comparison of performance analysis was done by the well-known IR evaluation metrics and experimented in the IR system. The results we obtained show that the lemmatization process improves IR performance and we achieved the best results using with Turkish Lemmatizer that is DTL radix-trie data structure and it used the minimum number of terms in IR system.

[1]  Gregory Grefenstette,et al.  Regular expressions for language engineering , 1996, Natural Language Engineering.

[2]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[3]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[4]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[5]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[6]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[7]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[8]  Fazli Can,et al.  Language Change Quantification Using Time-separated Parallel Translations , 2007, Lit. Linguistic Comput..

[9]  Lauri Karttunen,et al.  Two-level rule compiler , 1992 .

[10]  Kemal Oflazer,et al.  The architecture and the implementation of a finite state pronunciation lexicon for Turkish , 2006, Comput. Speech Lang..

[11]  Lauri Karttunen,et al.  Finite-state lexicon compiler , 1993 .

[12]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[13]  Jorge Hankamer,et al.  Morphological parsing and the lexicon , 1989 .

[14]  Fazli Can,et al.  Information retrieval on Turkish texts , 2008, J. Assoc. Inf. Sci. Technol..