Details on Stemming in the Language Modeling Framework

We incorporate stemming into the language modeling framework. The work is suggested by the notion that stemming increases the numbers of word occurrences used to estimate the probability of a word (by including the members of its stem class). As such, stemming can be viewed as a type of smoothing of probability estimates. We show that such a view of stemming leads to a simple incorporation of ideas from corpus-based stemming. We also present two generative models of stemming. The first generates terms and then variant stems. The second generates stem classes and then a member. All models are evaluated empirically, though there is little difference between the various forms of stemming.