Truncation of Content Terms for Turkish

Stemming, truncating, suffix stripping and decompounding algorithms used in information retrieval (IR) to reduce the content terms to their respective conflated forms are well-known algorithms for their causes for improving the retrieval performance as well as providing space and processing efficiency. In this paper we investigate the statistical characteristics of the truncated terms for Turkish on a text corpus consisting of more than 50 million words and attempt to measure the vocabulary growth rates for both the whole and truncated words. Findings indicate that the truncated words in Turkish exhibit a Zipfian behavior and that the whole words can successfully be truncated to the average word length (6.2 characters) without compromising performance effectiveness. The vocabulary growth rate for truncated words is about one third of that for the whole words. The result of our study is two fold. First it surely opens the room for truncation of content terms for Turkish for which there is no publicly available stemming code equipped with morphological analysis capability. Second, use of a truncation algorithm for indexing Turkish text may yield comparable effectiveness values with that of a stemming algorithm and hence, the need for stemming may become absolote, given that morphological analyzers for Turkish is highly complex in nature.

[1]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[2]  Campbell B. Read,et al.  Zipf's Law , 2004 .

[3]  Ángel F. Zazo Rodríguez,et al.  Spanish Monolingual Track: The Impact of Stemming on Retrieval , 2001, CLEF.

[4]  Yasar Tonta,et al.  Information Retrieval Effectiveness of Turkish Search Engines , 2002, ADVIS.

[5]  Bahar Karaoglan,et al.  Stemming in Agglutinative Languages: A Probabilistic Stemmer for Turkish , 2003, ISCIS.

[6]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Words and Phrases , 2002, COLING.

[7]  A. C. Cem Say,et al.  A Linguistically Motivated Information Retrieval System for Turkish , 2004, ISCIS.

[8]  T. Kalamboukis Suffix stripping with modern Greek , 1995 .

[9]  Mohammad Reza Meybodi,et al.  Bon: The Persian Stemmer , 2002, EurAsia-ICT.

[10]  Peter Willett,et al.  The effectiveness of stemming for natural‐language access to Slovene textual data , 1992 .

[11]  Kemal Oflazer,et al.  Design and Implementation of a Spelling Checker for Turkish , 1993 .

[12]  Gerald Salton,et al.  Automatic text processing , 1988 .

[13]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[14]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[15]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[16]  Gökhan Dalkiliç,et al.  Zipf's Law and Mandelbrot's Constants for Turkish Language Using Turkish Corpus (TurCo) , 2004, ADVIS.

[17]  Mehmet E. Kucuk,et al.  Application of Metadata Concepts to Discovery of Internet Resources , 2000, ADVIS.

[18]  Zainab Abu Bakar,et al.  Evaluating the Effectiveness of Thesaurus and Stemming Methods in Retrieving Malay Translated Al-Quran Documents , 2003, ICADL.

[19]  Advis,et al.  Advances in Information Systems: Second International Conference, ADVIS 2002, Izmir, Turkey, October 23-25, 2002. Proceedings , 2002 .

[20]  Tengku Mohd Tengku Sembok,et al.  Experiments with a stemming algorithm for Malay words , 1996 .

[21]  Donna Harman,et al.  How effective is suffixing , 1991 .

[22]  Michael F. Lynch,et al.  Stemming and N-gram matching for term conflation in Turkish texts , 1996, Information Research.

[23]  Gökhan Dalkiliç,et al.  A 300 MB Turkish Corpus and Word Analysis , 2002, ADVIS.