CSAT: A Chinese Segmentation and Tagging Module Based on the Interpolated Probabilistic Model

Chinese is a challenging language in natural language processing. Unlike other languages like English, Portuguese, the first step in Chinese text processing is the segmentation because there are no delimiters in a Chinese sentence for identifying the words boundaries in it. And there are many ambiguity problems during Chinese processing like segmentation ambiguities, unknown words problem, part-of-speech ambiguities, etc. In segmentation and tagging, one of the main tasks is to identify unknown words and recognize proper nouns. In the research, efforts are being paid on this particular problem. In this paper, an integrated application with segmentation and tagging ability has been studied and implemented. In the segmentation, a line of Chinese text is first split up into a sequence of atomic characters. Like this Chinese statement, Open image in new window (we go to play ball) is split up into Open image in new window with spaces and then for every atomic character in this statement, we are going to search through a knowledge base (necessary statistics about a segmented and tagged Chinese corpus, PFR corpus provided at http://www.icl.pku.edu.cn) to find all the words beginning with those atomic characters and keep this data in an appropriate structure. After that, a depth first search is performed on the data got to generate all the possible segmentations for the Chinese statement based on the words bi-gram model. Upon getting the results, all the candidates are evaluated and the N-best candidate segmentations are selected. In the second phase, an interpolated probabilistic tagging model proposed in [2] with proper nouns recognition and tagging is applied for tagging the N-best candidate segmented Chinese statements. Experiments based on different parameters (e.g., the parameter N used for getting N-best candidate segmentations) were taken for comparison so as to improve the performance of the application. This application is the first step for Chinese processing and it is used as a preprocessing module in the Chinese to Portuguesemachine translation system.