Applying Maximum Entropy to Robust Chinese Shallow Parsing

Recently, shallow parsing has been applied to various information processing systems, such as information retrieval, information extraction, question answering, and automatic document summarization. A shallow parser is suitable for online applications, because it is much more efficient and less demanding than a full parser. In this research, we formulate shallow parsing as a sequential tagging problem and use a supervised machine learning technique, Maximum Entropy (ME), to build a Chinese shallow parser. The major features of the ME-based shallow parser are POSs and the context words in a sentence. We adopt the shallow parsing results of Sinica Treebank as our standard, and select 30,000 and 10,000 sentences from Sinica Treebank as the training set and test set respectively. We then test the robustness of the shallow parser with noisy data. The experiment results show that the proposed shallow parser is quite robust for sentences with unknown proper nouns.

[1]  Keh-Jiann Chen,et al.  A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction , 2003, SIGHAN.

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  Dekai Wu,et al.  Parsing Chinese With an Almost-Context-Free Grammar , 1996, EMNLP.

[4]  Miles Osborne,et al.  Shallow Parsing using Noisy and Non-Stationary Training Material , 2002, J. Mach. Learn. Res..

[5]  Daniel M. Bikel A Statistical Model for Parsing and Word-Sense Disambiguation , 2000, EMNLP.

[6]  Chu-Ren Huang,et al.  Sinica Treebank: Design Criteria, Representational Issues and Implementation , 2004 .

[7]  Yin Li,et al.  The Construction of A Chinese Shallow Treebank , 2004, SIGHAN@ACL.

[8]  Tiejun Zhao,et al.  Statistics Based Hybrid Approach to Chinese Base Phrase Identification , 2000, ACL 2000.

[9]  Linqiao Zhang Roach to Extract Chinese Chunk Candidates from Large Corpora , 2003 .

[10]  Ruifeng Xu,et al.  Machine learning approaches for Chinese shallow parsers , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[11]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[12]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[13]  Tianshun Yao,et al.  Chinese Chunk Identification Using SVMs Plus Sigmoid , 2004, IJCNLP.

[14]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[15]  Ming Zhou A Block-Based Robust Dependency Parser for Unrestricted Chinese Text , 1999, ACL 2000.

[16]  Qin Lu,et al.  Building a Chinese Shallow Parsed TreeBank for Collocation Extraction , 2003, CICLing.

[17]  Frank Henrik Müller,et al.  Annotating Topological Fields and Chunks - and Revising POS Tags at the Same Time , 2002, COLING.

[18]  Tianshun Yao,et al.  Applying Conditional Random Fields to Chinese Shallow Parsing , 2005, CICLing.

[19]  Anne Abeillé,et al.  Treebanks: Building and Using Parsed Corpora , 2003 .

[20]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[21]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.