The Discovery of Natural Typing Annotations: User-produced Potential Chinese Word Delimiters

Human labeled corpus is indispensable for the training of supervised word segmenters. However, it is time-consuming and laborintensive to label corpus manually. During the process of typing Chinese text by Pingyin, people usually need to type "space" or numeric keys to choose the words due to homophones, which can be viewed as a cue for segmentation. We argue that such a process can be used to build a labeled corpus in a more natural way. Thus, in this paper, we investigate Natural Typing Annotations (NTAs) that are potential word delimiters produced by users while typing Chinese. A detailed analysis on over three hundred user-produced texts containing NTAs reveals that highquality NTAs mostly agree with gold segmentation and, consequently, can be used for improving the performance of supervised word segmentation model in out-of-domain. Experiments show that a classification model combined with a voting mechanism can reliably identify the high-quality NTAs texts that are more readily available labeled corpus. Furthermore, the NTAs might be particularly useful to deal with out-of-vocabulary (OOV) words such as proper names and neo-logisms.

[1]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[2]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[3]  Maosong Sun,et al.  Punctuation as Implicit Annotations for Chinese Word Segmentation , 2009, CL.

[4]  Qun Liu,et al.  Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging - A Case Study , 2009, ACL/IJCNLP.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Stephen Clark,et al.  Chinese Segmentation with a Word-Based Perceptron Algorithm , 2007, ACL.

[7]  Yoshimasa Tsuruoka,et al.  Improving Chinese Word Segmentation and POS Tagging with Semi-supervised Methods Using Large Auto-Analyzed Data , 2011, IJCNLP.

[8]  Zhiyuan Liu,et al.  Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method , 2011, ACL.

[9]  Shumin Zhai,et al.  Chinese input with keyboard and eye-tracking: an anatomical study , 2001, CHI.

[10]  Qun Liu,et al.  Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study , 2013, ACL.

[11]  Xuanjing Huang,et al.  Automatic Corpus Expansion for Chinese Word Segmentation by Exploiting the Redundancy of Web Information , 2014, COLING.

[12]  Yang Ji,et al.  Temporal Pattern of User Behavior in Micro-blog , 2013, J. Softw..

[13]  Yue Zhang,et al.  Domain Adaptation for CRF-based Chinese Word Segmentation using Free Annotations , 2014, EMNLP.

[14]  Lucas Stephane,et al.  User Behavior Patterns: Gathering, Analysis, Simulation and Prediction , 2009, HCI.

[15]  Yue Zhang,et al.  Unsupervised Domain Adaptation for Joint Segmentation and POS-Tagging , 2012, COLING.

[16]  Weiwei Sun,et al.  Enhancing Chinese Word Segmentation Using Unlabeled Data , 2011, EMNLP.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.