Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

One of the reasons nonparametric Bayesian inference is attracting attention in computational linguistics is because it provides a principled way of learning the units of generalization together with their probabilities. Adaptor grammars are a framework for defining a variety of hierarchical nonparametric Bayesian models. This paper investigates some of the choices that arise in formulating adaptor grammars and associated inference procedures, and shows that they can have a dramatic impact on performance in an unsupervised word segmentation task. With appropriate adaptor grammars and inference procedures we achieve an 87% word token f-score on the standard Brent version of the Bernstein-Ratner corpus, which is an error reduction of over 35% over the best previously reported results for this corpus.

[1]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[2]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[3]  J. Pitman Exchangeable and partially exchangeable random partitions , 1995 .

[4]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[5]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[6]  Aravind K. Joshi,et al.  Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[7]  Rens Bod,et al.  Beyond Grammar: An Experience-Based Theory of Language , 1998 .

[8]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[9]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[10]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[11]  Ruslan Mitkov,et al.  The Oxford handbook of computational linguistics , 2003 .

[12]  Lancelot F. James,et al.  Generalized weighted Chinese restaurant processes for species sampling mixture models , 2003 .

[13]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[14]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[15]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[16]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[17]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[18]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[19]  B. Schölkopf,et al.  Edinburgh Research Explorer Interpolating between types and tokens by estimating power-law generators , 2006 .

[20]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[21]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[22]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[23]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.

[24]  Mark Johnson,et al.  Using Adaptor Grammars to Identify Synergies in the Unsupervised Acquisition of Linguistic Structure , 2008, ACL.

[25]  Margaret M. Fleck Lexicalized Phonotactic Word Segmentation , 2008, ACL.

[26]  Thomas L. Griffiths,et al.  Distributional Cues to Word Boundaries: Context is Important , 2008 .

[27]  Jeffrey Heinz,et al.  Improving Word Segmentation by Simultaneously Learning Phonotactics , 2008, CoNLL.