Do We Need Chinese Word Segmentation for Statistical Machine Translation?

In Chinese texts, words are not separated by white spaces. This is problematic for many natural language processing tasks. The standard approach is to segment the Chinese character sequence into words. Here, we investigate Chinese word segmentation for statistical machine translation. We pursue two goals: the first one is the maximization of the final translation quality; the second is the minimization of the manual effort for building a translation system. The commonly used method for getting the word boundaries is based on a word segmentation tool and a predefined monolingual dictionary. To avoid the dependence of the translation system on an external dictionary, we have developed a system that learns a domainspecific dictionary from the parallel training corpus. This method produces results that are comparable with the predefined dictionary. Further more, our translation system is able to work without word segmentation with only a minor loss in translation quality.