Corpus-Specific Stemming using Work Form Co-occurrence

Stemming is used in many information retrieval (IR) systems to reduce word forms to common roots. It is one of the simplest and most successful applications of natural language processing for IR. Current stemming algorithms are, however, either inflexible or difficult to adapt to the specific characteristics of a text corpus, except by the manual definition of exception lists. We propose a technique for using corpus-based word co-occurence statistics to modify a stemmer. Experiments show that this technique is effective and is very suitable for query-based stemming.