An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a least-effort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the "CHILDES" corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Mr. Jin Hu Chinese Word Segmentation Based on Contextual Entropy , 2003 .

[3]  Yuji Matsumoto,et al.  Training Conditional Random Fields Using Incomplete Annotations , 2008, COLING.

[4]  Yu Hua UNSUPERVISED WORD INDUCTION USING MDL CRITERION , 2000 .

[5]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[6]  Hang Li,et al.  A Probabilistic Approach to Lexical Semantic Knowledge Acquisition and Structural Disambiguation , 1998, ArXiv.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Z. Harris From Phoneme to Morpheme , 1955 .

[9]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[10]  Yunke. Hua UNSUPERVISED WORD INDUCTION USING MDL CRITERION , 2000 .

[11]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[12]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[13]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[14]  Yong Yu,et al.  Recommending questions using the mdl-based tree cut model , 2008, WWW.

[15]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[16]  Dmitry Zelenko,et al.  Combining MDL Transliteration Training with Discriminative Modeling , 2009, NEWS@IJCNLP.

[17]  Mark Johnson,et al.  Nonparametric bayesian models of lexical acquisition , 2007 .

[18]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[19]  Chunyu Kit,et al.  How Does Lexical Acquisition Begin? A cognitive perspective , 2003 .

[20]  Makoto Nagao,et al.  Building a Japanese parsed corpus while improving the parsing system , 1997 .

[21]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[22]  André Kempe,et al.  Experiments in Unsupervised Entropy-Based Corpus Segmentation , 1999, CoNLL.

[23]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[24]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[25]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[26]  Hwee Tou Ng,et al.  Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.

[27]  Tetsuji Nakagawa,et al.  A Hybrid Approach to Word Segmentation and POS Tagging , 2007, ACL.

[28]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[29]  Ceriel J. H. Jacobs,et al.  Using MDL for Grammar Induction , 2006, ICGI.

[30]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[31]  Michael D. Alder,et al.  Finding Structure via Compression , 1998, CoNLL.

[32]  Kumiko Tanaka-Ishii,et al.  Unsupervised Segmentation of Chinese Text by Use of Branching Entropy , 2006, ACL.