Classifying Commas for Patent Machine Translation

Commas are widely distributed and used in Chinese and play important role in detecting boundary of basic units in sentences and discourses. Towards Chinese-English patent machine translation, this paper presents two methods using rich linguistic information to identify commas which separate sub-sentences and non-sub-sentences. The first method employs word knowledge base and formal rules to determine roles of commas, while the second one uses machine learning approaches. The experimental results show that overall F1 scores of rule-based method are higher than 93%, indicating the approach performs well in classifying commas. On the other hand, the classifiers show some differences. We also draw the conclusion that identifying commas is actually able to improve the quality of translation outputs.