Segmentation of Chinese Long Sentences Using Commas

The comma is the most common form of punctuation. As such, it may have the greatest effect on the syntactic analysis of a sentence. As an isolate language, Chinese sentences have fewer cues for parsing. The clues for segmentation of a long Chinese sentence are even fewer. However, the average frequency of comma usage in Chinese is higher than other languages. The comma plays an important role in long Chinese sentence segmentation. This paper proposes a method for classifying commas in Chinese sentences by their context, then segments a long sentence according to the classification results. Experimental results show that accuracy for the comma classification reaches 87.1 percent, and with our segmentation model, our parser’s dependency parsing accuracy improves by 9.6 percent.

[1]  Anthony Kroch,et al.  The Bracketing Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[2]  Varol Akman,et al.  Current approaches to punctuation in computational linguistics , 1996, Comput. Humanit..

[3]  Dan Roth,et al.  Learning and Inference for Clause Identification , 2002, ECML.

[4]  Mi-Young Kim,et al.  Resolving Ambiguity in Inter-chunk Dependency Parsing , 2001, NLPRS.

[5]  Bernard E. M. Jones Towards Testing the Syntax of Punctuation , 1996, ACL.

[6]  Varol Akman,et al.  An Analysis of English Punctuation , 1998 .

[7]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[8]  Nianwen Xue,et al.  The Bracketing Guidelines for the Penn Chinese Treebank Project , 2000 .

[9]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[10]  Hervé Déjean,et al.  Introduction to the CoNLL-2001 shared task: clause identification , 2001, CoNLL.

[11]  Byoung-Tak Zhang,et al.  Learning-based Intrasentence Segmentation for Efficient Translation of Long Sentences , 2001, Machine Translation.

[13]  Bernard Jones,et al.  What's the point? : a (computational) theory of punctuation , 1996 .

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Roger Levy,et al.  Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[16]  Vilson J. Leffa Clause processing in cornplex sentences , 1998 .

[17]  Tzusheng Pei,et al.  Parsing Long English Sentences with Pattern Rules , 1990, COLING.

[18]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[19]  Bernard E. M. Jones Exploring The Role Of Punctuation In Parsing Natural Text , 1994, COLING.