Statistical language modeling plays an important role in a state-of-the-art language processing system, such as speech recognizer, spelling checker, etc. The most popular language model (LM) is word n-gram model, which needs sentences annotated with word boundary information. In various Asian languages, however, words are not delimited by whitespace, so we need to annotate sentences with word boundary information to prepare a statistically reliable large corpus. In this paper, we present the concept of a stochastically segmented corpus, which consists of a raw corpus and word boundary probabilities, and a method for calculating word n-gram probabilities from a stochastically segmented corpus. In the experiment, our method is applied to a LM adaptation problem and showed an advantage to an existing method.
[1]
Masaaki Nagata,et al.
A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm
,
1994,
COLING.
[2]
Masaaki Nagata.
Context-Based Spelling Correction for Japanese OCR
,
1996,
COLING.
[3]
Anthony J. Robinson,et al.
Language model adaptation using mixtures and an exponentially decaying cache
,
1997,
1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[4]
Robert L. Mercer,et al.
An Estimate of an Upper Bound for the Entropy of English
,
1992,
CL.
[5]
Eugene W. Myers,et al.
Suffix arrays: a new method for on-line string searches
,
1993,
SODA '90.