A maximum-entropy chinese parser augmented by transformation-based learning

Parsing, the task of identifying syntactic components, e.g., noun and verb phrases, in a sentence, is one of the fundamental tasks in natural language processing. Many natural language applications such as spoken-language understanding, machine translation, and information extraction, would benefit from, or even require, high accuracy parsing as a preprocessing step. Even though most state-of-the-art statistical parsers were initially constructed for parsing in English, most of them are not language-specific, in that they do not rely on properties of the language that are specific to English. Therefore, construction of a parser in a given language becomes a matter of retraining the statistical parameters with a Treebank in the corresponding language. The development of the Chinese treebank [Xia et al. 2000] spurred the construction of parsers for Chinese. However, Chinese as a language poses some unique problems for the development of a statistical parser, the most apparent being word segmentation. Since words in written Chinese are not delimited in the same way as in Western languages, the first problem that needs to be solved before an existing statistical method can be applied to Chinese is to identify the word boundaries. This is a step that is neglected by most pre-existing Chinese parsers, which assume that the input data has already been pre-segmented. This article describes a character-based statistical parser, which gives the best performance to-date on the Chinese treebank data. We augment an existing maximum entropy parser with transformation-based learning, creating a parser that can operate at the character level. We present experiments that show that our parser achieves results that are close to those achievable under perfect word segmentation conditions.

[1]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[4]  E. T. Jaynes,et al.  Maximum Entropy and Bayesian Methods in Applied Statistics: Monkeys, Kangaroos, and N , 1986 .

[5]  Pascale Fung,et al.  Improving Chinese Tokenization With Linguistic Filters On Statistical Lexical Acquisition , 1994, ANLP.

[6]  Ralph Weischedel,et al.  A statistical parser for Chinese , 2002 .

[7]  Grace Ngai,et al.  Multidimensional transformation-based learning , 2001, CoNLL.

[8]  Xiaoqiang Luo A Maximum Entropy Chinese Character-Based Parser , 2003, EMNLP.

[9]  E. Jaynes Monkeys, Kangaroos, and N , 1999 .

[10]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[11]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[12]  David D. Palmer,et al.  A Trainable Rule-Based Algorithm for Word Segmentation , 1997, ACL.

[13]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[14]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[15]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[16]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[17]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.