A Simple and Effective Unsupervised Word Segmentation Approach

In this paper, we propose a new unsupervised approach for word segmentation. The core idea of our approach is a novel word induction criterion called WordRank, which estimates the goodness of word hypotheses (character or phoneme sequences). We devise a method to derive exterior word boundary information from the link structures of adjacent word hypotheses and incorporate interior word boundary information to complete the model. In light of WordRank, word segmentation can be modeled as an optimization problem. A Viterbistyled algorithm is developed for the search of the optimal segmentation. Extensive experiments conducted on phonetic transcripts as well as standard Chinese and Japanese data sets demonstrate the effectiveness of our approach. On the standard Brent version of Bernstein-Ratner corpora, our approach outperforms the state-of-the-art Bayesian models by more than 3%. Plus, our approach is simpler and more efficient than the Bayesian methods. Consequently, our approach is more suitable for real-world applications.

[1]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[2]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[3]  Le Zhang,et al.  Statistical Substring Reduction in Linear Time , 2004, IJCNLP.

[4]  Makoto Nagao,et al.  Building a Japanese parsed corpus while improving the parsing system , 1997 .

[5]  Xiaotie Deng,et al.  Unsupervised Segmentation of Chinese Corpus Using Accessor Variety , 2004, IJCNLP.

[6]  Kumiko Tanaka-Ishii,et al.  Unsupervised Segmentation of Chinese Text by Use of Branching Entropy , 2006, ACL.

[7]  Mark Johnson,et al.  Using Adaptor Grammars to Identify Synergies in the Unsupervised Acquisition of Linguistic Structure , 2008, ACL.

[8]  Margaret M. Fleck Lexicalized Phonotactic Word Segmentation , 2008, ACL.

[9]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[10]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[11]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[12]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[13]  Lillian Lee,et al.  Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji , 2000, ANLP.

[14]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[15]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[16]  Hiroya Takamura,et al.  An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL , 2010, EMNLP.

[17]  Chengqing Zong,et al.  A Character-Based Joint Model for Chinese Word Segmentation , 2010, COLING.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.