FindStem: Analysis and Evaluation of a Turkish Stemming Algorithm

In this paper, we evaluate the effectiveness of a new stemming algorithm, FINDSTEM, for use with Turkish documents and queries, and compare the use of this algorithm with the other two previously defined Turkish stemmers, namely ”A-F” and ”L-M” algorithms. Of them, the FINDSTEM and A-F algorithms employ inflectional and derivational stemmers, whereas the L-M one handles only inflectional rules. Comparison of stemming algorithms was done manually using 5,000 distinct words out of which the FINDSTEM, A-F, and L-M failed on, in respect, 49, 270, and 559 cases. A medium-size collection, which is comprised of 2,468 law records with 280K document words, 15 queries in natural language with average length of 17 search words, and a complete relevancy information for each query, was used for the effectiveness of the stemming algorithm FINDSTEM. We localized SMART retrieval system in terms of a stopping list, introduction of Turkish characters, i.e., the ISO8859-9 (Latin-5) code set, a stemming algorithm (FINDSTEM), and a Turkish translation at message level. Our results based on average precision values at 11-point recall levels shows that indexing document as well as search terms with the use of FINDSTEM for stemming is clearly and consistently more effective than the one where the terms are indexed as they are (that is, no stemming at all).