Unknown Chinese word extraction based on variety of overlapping strings

Not all languages, e.g. Chinese, have delimiters for words. To extract words from a sentence in these languages, we usually rely on a dictionary for known words. For unknown words, some approaches rely on a domain specific dictionary or a tailor-made learning data set. However, this information may not be available. Another direction is to use unsupervised methods. These methods rely on a goodness measure to evaluate how likely the words are meaningful based on a statistical argument on the given text. The most challenging issue is to identify low-frequency meaningful words. In this paper, we first show by an empirical study on Chinese texts that all classical goodness measures cannot separate low-frequency meaningful and meaningless words effectively. To solve this problem, we propose a new goodness measure, the overlap variety method. The key idea behind the new measure is not to consider the absolute number of occurrences of the candidate (i.e., a string of Chinese characters) but to compare the goodness measures (we use the accessor variety) of the candidate and those of the strings overlapping the candidate. The candidate is likely to be meaningful if its accessor variety is larger than the accessor varieties of the overlapping strings. We implement an extraction system for unknown Chinese word, UNExtract, based on this overlap variety method. We evaluate our approach using the CIPS-SIGHAN-2010 bake off corpora and show that the proposed measure is more effective than the other five state-of-the-art goodness measures (accessor variety, branch entropy, description length gain, frequency substring reduction, pointwise mutual information), especially for low-frequency words and bi-gram words.

[1]  Xu Sun,et al.  Sequential Labeling with Latent Variables: An Exact Inference Algorithm and its Efficient Approximation , 2009, EACL.

[2]  Xiaotie Deng,et al.  Unsupervised Segmentation of Chinese Corpus Using Accessor Variety , 2004, IJCNLP.

[3]  Richard Sproat,et al.  A statistical method for finding word boundaries in Chinese text , 1990 .

[4]  Xiaolong Wang,et al.  Chinese Unknown Word Recognition Using Improved Conditional Random Fields , 2008, 2008 Eighth International Conference on Intelligent Systems Design and Applications.

[5]  David M. W. Powers,et al.  Chinese Word Segmentation Based on Contextual Entropy , 2003, PACLIC.

[6]  Keh-Jiann Chen,et al.  Unknown Word Extraction for Chinese Documents , 2002, COLING.

[7]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[8]  Xu Sun,et al.  Modeling Latent-Dynamic in Shallow Parsing: A Latent Conditional Model with Improved Inference , 2008, COLING.

[9]  Gertjan van Noord,et al.  Acquisition of Unknown Word Paradigms for Large-Scale Grammars , 2010, COLING.

[10]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[11]  Weiwei Sun Word-based and Character-based Word Segmentation Models: Comparison and Combination , 2010, COLING.

[12]  Lillian Lee,et al.  Mostly-unsupervised statistical segmentation of Japanese kanji sequences , 2002, Natural Language Engineering.

[13]  Xu Sun,et al.  Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection , 2012, ACL.

[14]  Lee-Feng Chien,et al.  PAT-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval , 1999, Inf. Process. Manag..

[15]  Gertjan van Noord,et al.  Combining Finite State and Corpus-based Techniques for Unknown Word Prediction , 2009, RANLP.

[16]  Hai Zhao,et al.  Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation , 2008 .

[17]  Kumiko Tanaka-Ishii,et al.  Unsupervised Segmentation of Chinese Text by Use of Branching Entropy , 2006, ACL.

[18]  Hai Zhao,et al.  Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition , 2008, IJCNLP.

[19]  Hai Zhao,et al.  An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework , 2008, IJCNLP.

[20]  Le Zhang,et al.  Statistical Substring Reduction in Linear Time , 2004, IJCNLP.

[21]  Chih-Ming Chen,et al.  Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems , 2009, Expert Syst. Appl..

[22]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[23]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[24]  Nianwen Xue,et al.  Chinese Word Segmentation as Character Tagging , 2003, ROCLING/IJCLCLP.

[25]  Wenbo Pang,et al.  Chinese Unknown Words Extraction Based on Word-Level Characteristics , 2009, 2009 Ninth International Conference on Hybrid Intelligent Systems.

[26]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[27]  Andi Wu,et al.  Statistically-Enhanced New Word Identification in a Rule-Based Chinese System , 2000, ACL 2000.

[28]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.