论文信息 - CSAT: A Chinese Segmentation and Tagging Module Based on the Interpolated Probabilistic Model

CSAT: A Chinese Segmentation and Tagging Module Based on the Interpolated Probabilistic Model

Chinese is a challenging language in natural language processing. Unlike other languages like English, Portuguese, the first step in Chinese text processing is the segmentation because there are no delimiters in a Chinese sentence for identifying the words boundaries in it. And there are many ambiguity problems during Chinese processing like segmentation ambiguities, unknown words problem, part-of-speech ambiguities, etc. In segmentation and tagging, one of the main tasks is to identify unknown words and recognize proper nouns. In the research, efforts are being paid on this particular problem. In this paper, an integrated application with segmentation and tagging ability has been studied and implemented. In the segmentation, a line of Chinese text is first split up into a sequence of atomic characters. Like this Chinese statement, Open image in new window (we go to play ball) is split up into Open image in new window with spaces and then for every atomic character in this statement, we are going to search through a knowledge base (necessary statistics about a segmented and tagged Chinese corpus, PFR corpus provided at http://www.icl.pku.edu.cn) to find all the words beginning with those atomic characters and keep this data in an appropriate structure. After that, a depth first search is performed on the data got to generate all the possible segmentations for the Chinese statement based on the words bi-gram model. Upon getting the results, all the candidates are evaluated and the N-best candidate segmentations are selected. In the second phase, an interpolated probabilistic tagging model proposed in [2] with proper nouns recognition and tagging is applied for tagging the N-best candidate segmented Chinese statements. Experiments based on different parameters (e.g., the parameter N used for getting N-best candidate segmentations) were taken for comparison so as to improve the performance of the application. This application is the first step for Chinese processing and it is used as a preprocessing module in the Chinese to Portuguesemachine translation system.

Ming Chui Dong | Fai Wong | Ka Seng Leong | Chi Wai Tang

[1] Qiang Zhou,et al. Blending Segmentation With Tagging In Chinese Language Corpus Processing , 1994, COLING.

[2] Zhang Hua-ping. Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method , 2002 .

[3] Fai Wong,et al. Interpolated probabilistic tagging model optimized with genetic algorithm , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[4] James H. Martin,et al. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[5] Xu Dongliang,et al. Integrated Chinese word segmentation and part-of-speech tagging based on the divide-and-conquer strategy , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[6] Qun Liu,et al. Chinese Lexical Analysis Using Hierarchical Hidden Markov Model , 2003, SIGHAN.