Human-Computer Interactive Chinese Word Segmentation: An Adaptive Dirichlet Process Mixture Model Approach

Previous research shows that Kalman filter based human-computer interactive Chinese word segmentation achieves an encouraging effect in reducing user interventions, but suffers from the drawback of incompetence in distinguishing segmentation ambiguities. This paper proposes a novel approach to handle this problem by using an adaptive Dirichlet process mixture model. By adjusting the hyperparameters of the model, ideal classifiers can be generated to conform to the interventions provided by the users. Experiments reveal that our approach achieves a notable improvement in handling segmentation ambiguities. With knowledge learnt from users, our model outperforms the baseline Kalman filter model by about 0.5% in segmenting homogeneous texts.

[1]  Kenji Araki,et al.  A Word Segmentation Method with Dynamic Adapting to Text Using Inductive Learning , 2002, SIGHAN@COLING.

[2]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[3]  Sun Mao Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy , 2004 .

[4]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[5]  Junfeng Hu,et al.  The Application of Kalman Filter Based Human-Computer Learning Model to Chinese Word Segmentation , 2013, CICLing.

[6]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[7]  Qun Liu,et al.  Chinese Lexical Analysis Using Hierarchical Hidden Markov Model , 2003, SIGHAN.

[8]  Feng Chong,et al.  Active Learning in Chinese Word Segmentation Based on Multigram Language Model , 2006 .

[9]  Kumiko Tanaka-Ishii,et al.  Unsupervised Segmentation of Chinese Text by Use of Branching Entropy , 2006, ACL.

[10]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[11]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[12]  Byron Hall Bayesian Inference , 2011 .

[13]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[14]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[15]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[16]  Michael A. West,et al.  Hierarchical priors and mixture models, with applications in regression and density estimation , 2006 .

[17]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[18]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[19]  Chen Xiao-he A Human-Computuer Interaction Word Segmentation Method Adapting to Chinese Unknown Texts , 2007 .

[20]  Maosong Sun,et al.  Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information , 2002, COLING.

[21]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[22]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 2022, International Conference on Computational Linguistics.

[23]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[24]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.