A Trainable Rule-based Algorithm for Word Segmentation

This paper presents a trainable rule-based algorithm for performing word segmentation. The algorithm provides a simple, language-independent alternative to large-scale lexicai-based segmenters requiring large amounts of knowledge engineering. As a stand-alone segmenter, we show our algorithm to produce high performance Chinese segmentation. In addition, we show the transformation-based algorithm to be effective in improving the output of several existing word segmentation algorithms in three different languages. 1 I n t r o d u c t i o n This paper presents a trainable rule-based algorithm for performing word segmentation. Our algorithm is effective both as a high-accuracy stand-alone segmenter and as a postprocessor that improves the output of existing word segmentation algorithms. In the writing systems of many languages, including Chinese, Japanese, and Thai, words are not delimited by spaces. Determining the word boundaries, thus tokenizing the text, is usually one of the first necessary processing steps, making tasks such as part-of-speech tagging and parsing possible. A variety of methods have recently been developed to perform word segmentation and the results have been published widely. 1 A major difficulty in evaluating segmentation algorithms is that there are no widely-accepted guidelines as to what constitutes a word, and there is therefore no agreement on how to "correctly" segment a text in an unsegmented language. It is 1Most published segmentation work has been done for Chinese. For a discussion of recent Chinese segmentation work, see Sproat et al. (1996). frequently mentioned in segmentation papers that native speakers of a language do not always agree about the "correct" segmentation and that the same text could be segmented into several very different (and equally correct) sets of words by different native speakers. Sproat et a1.(1996) and Wu and Fung (1994) give empirical results showing that an agreement rate between native speakers as low as 75% is common. Consequently, an algorithm which scores extremely well compared to one native segmentation may score dismally compared to other, equally "correct" segmentations. We will discuss some other issues in evaluating word segmentation in Section 3.1. One solution to the problem of multiple correct segmentations might be to establish specific guidelines for what is and is not a word in unsegmented languages. Given these guidelines, all corpora could theoretically be uniformly segmented according to the same conventions, and we could directly compare existing methods on the same corpora. While this approach has been successful in driving progress in NLP tasks such as part-of-speech tagging and parsing, there are valid arguments against adopting it for word segmentation. For example, since word segmentation is merely a preprocessing task for a wide variety of further tasks such as parsing, information extraction, and information retrieval, different segmentations can be useful or even essential for the different tasks. In this sense, word segmentation is similar to speech recognition, in which a system must be robust enough to adapt to and recognize the multiple speaker-dependent "correct" pronunciations of words. In some cases, it may also be necessary to allow multiple "correct" segmentations of the same text, depending on the requirements of further processing steps. However, many algorithms use extensive domain-specific word lists and intricate name recognition routines as well as hard-coded morphological analysis modules to produce a predetermined segmentation output. Modifying or retargeting an