The Construction of a Dictionary for a Two-layer Chinese Morphological Analyzer

We built a morphological analyzer, which can be freely used by anyone for research purpose. In order to build a pratical system, a dictionary with reasonable size is necessary. The initial dictionary is built from the Penn Chinese Treebank corpus v4.0 and contains only 33,438 entries. Since the initial dictionary is quite small, unknown word detection methods are applied to a huge raw text in order to extract new words to be added into the system dictionary. We have successfully constructed a dictionary with 120,769 entries. Finally, we propose a two-layer morphological analyzer to cater for two sets of outputs. The first layer produces the minimal segmentation units defined by us, and the second layer transforms the output of the first layer to the original segmentation units defined by Penn Chinese Treebank.