We present a self-organized method to build a stochastic Japanese word segmenter from a small number of basic words and a large amount of unsegmented training text. It consists of a word-based statistical language model, an initial estimation procedure, and a re-estimation procedure. Initial word frequencies are estimated by counting all possible longest match strings between the training text and the word list. The initial word list is au~nented by identifying words in the training text using a heuristic rule based on character type. The word-based language model is then re-estimated to filter out inappropriate word hypotheses generated by the initial word identification. When the word segmeuter is trained on 3.9M character texts and 1719 initial words, its word segmentation accuracy is 86.3% recall and 82.5% precision. We find that the combination of heuristic word identi~cation and re-estimation is so effective that the initial word list need not be large. 1 I n t r o d u c t i o n Word segmentation is an important problem for Japanese because word boundaries are not marked in its writing system. Other Asian languages such as Chinese and Thai have the same problem. Any Japanese NLP application requ/res word segmentation as the first stage because there are phonological and semantic units whose pronunciation and meaning is not trivially derivable from that of the individual characters. Once word segmentation is done, all established techniques can be exploited to build practically important applications such as spelling correction [Nagata, 1996] and text retrieval [Nie and Brisebois, 1996] In a sense, Japanese word segmentation is a solved problem if (and only if) we have plenty of segmented training text. Around 95% word segmentation accuracy is reported by using a word-based language model and the Viterbi-like dynamic programi-g procedure [Nagata, 1994, Takeuchi and Matsumoto, 1995, Yamamoto, 1996]. However, manually segmented corpora are not always available in a particular target domain and manual segmentation is very expensive. The goal of our research is unsupervised learning of Japanese word segmentation. That is, to build a Japanese word segmenter from a list of initial words and unsegmented training text. Today, it is easy to obtain a 10K-100K word list from either commercial or public domain on-line Japanese dictionaries. Gigabytes of Japanese text are readily available from newspapers, patents, HTML documents, etc.. Few works have examined unsupervised word segmentation in Japanese. Both [Yamamoto, 1996] and [Takeuchi and Matsumoto, 1995] built a word-based language model from unsegmented text
[1]
Chilin Shih,et al.
A Stochastic Finite-State Word-Segmentation Algorithm for Chinese
,
1994,
ACL.
[2]
Masaaki Nagata,et al.
A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm
,
1994,
COLING.
[3]
Jian-Yun Nie,et al.
On Chinese text retrieval
,
1996,
SIGIR '96.
[4]
Xiaoqiang Luo,et al.
An Iterative Algorithm to Build Chinese Language Models
,
1996,
ACL.
[5]
David Elworthy,et al.
Does Baum-Welch Re-estimation Help Taggers?
,
1994,
ANLP.
[6]
Zimin Wu,et al.
Chinese Text Segmentation for Text Retrieval: Achievements and Problems
,
1993,
J. Am. Soc. Inf. Sci..
[7]
Masaaki Nagata.
Context-Based Spelling Correction for Japanese OCR
,
1996,
COLING.
[8]
Keh-Yih Su,et al.
Automatic Construction of a Chinese Electronic Dictionary
,
1995,
VLC@ACL.
[9]
Mikio Yamamoto,et al.
A Re-estimation Method for Stochastic Language Modeling from Ambiguous Observations
,
1996,
VLC@COLING.
[10]
Eugene W. Myers,et al.
Suffix arrays: a new method for on-line string searches
,
1993,
SODA '90.
[11]
Julian M. Kupiec,et al.
Robust part-of-speech tagging using a hidden Markov model
,
1992
.