Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

We propose a nonparametric Bayesian model for joint unsupervised word segmentation and part-of-speech tagging from raw strings. Extending a previous model for word segmentation, our model is called a Pitman-Yor Hidden SemiMarkov Model (PYHSMM) and considered as a method to build a class n-gram language model directly from strings, while integrating character and word level information. Experimental results on standard datasets on Japanese, Chinese and Thai revealed it outperforms previous results to yield the state-of-the-art accuracies. This model will also serve to analyze a structure of a language whose words are not identified a priori.

[1]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[2]  Kevin P. Murphy Hidden semi-Markov models ( HSMMs ) , 2002 .

[3]  Eiichiro Sumita,et al.  The Infinite Markov Model , 2007, NIPS.

[4]  Makoto Nagao,et al.  Building a Japanese parsed corpus while improving the parsing system , 1997 .

[5]  Zhao Hai,et al.  Chinese Word Segmentation: A Decade Review , 2007 .

[6]  S. L. Scott Bayesian Methods for Hidden Markov Models , 2002 .

[7]  Hiroya Takamura,et al.  An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL , 2010, EMNLP.

[8]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[9]  Baobao Chang,et al.  A Joint Model for Unsupervised Chinese Word Segmentation , 2014, EMNLP.

[10]  Tatsuya Kawahara,et al.  Learning a language model from continuous speech , 2010, INTERSPEECH.

[11]  Shunzheng Yu,et al.  Hidden semi-Markov models , 2010, Artif. Intell..

[12]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[13]  Yee Whye Teh,et al.  A Bayesian Interpretation of Interpolated Kneser-Ney , 2006 .

[14]  Phil Blunsom,et al.  A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction , 2011, ACL.

[15]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[16]  Tomoaki Nakamura,et al.  Mutual learning of an object concept and language model based on MLDA and NPYLM , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.

[18]  Edoardo M. Airoldi,et al.  Notes on the Negative Binomial distribution for word occurrences , 2005 .

[19]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[20]  Noah A. Smith,et al.  Nonparametric Word Segmentation for Machine Translation , 2010, COLING.

[21]  Zoubin Ghahramani,et al.  The infinite HMM for unsupervised PoS tagging , 2009, EMNLP.

[22]  Benoît Sagot,et al.  Can MDL Improve Unsupervised Chinese Word Segmentation? , 2013, SIGHAN@IJCNLP.

[23]  Matthew J. Johnson,et al.  Bayesian nonparametric hidden semi-Markov models , 2012, J. Mach. Learn. Res..

[24]  Tanel Alumäe,et al.  A Hierarchical Dirichlet Process Model for Joint Part-of-Speech and Morphology Induction , 2012, NAACL.

[25]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[26]  Thomas L. Griffiths,et al.  Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models , 2011, J. Mach. Learn. Res..

[27]  Kikuo Maekawa KOTONOHA and BCCWJ : Development of a Balanced Corpus of Contemporary Written Japanese , 2007 .