Word N-gram Probability Calculation from a Stochastically Segmented Corpus

Statistical language modeling plays an important role in a state-of-the-art language processing system, such as speech recognizer, spelling checker, etc. The most popular language model (LM) is word n-gram model, which needs sentences annotated with word boundary information. In various Asian languages, however, words are not delimited by whitespace, so we need to annotate sentences with word boundary information to prepare a statistically reliable large corpus. In this paper, we present the concept of a stochastically segmented corpus, which consists of a raw corpus and word boundary probabilities, and a method for calculating word n-gram probabilities from a stochastically segmented corpus. In the experiment, our method is applied to a LM adaptation problem and showed an advantage to an existing method.