Extended word similarity based clustering on unsupervised PoS induction to improve English-Indonesian statistical machine translation

In this paper, we present the unsupervised Part-of-Speech (PoS) induction algorithm to improve translations quality on statistical machine translation. The proposed algorithm is an extension of the algorithm Word-Similarity-Based (WSB) clustering. In the clustering, the similarity between words is measured by its grammatical relation with other words. The grammatical relation is represented as the n-gram relation. We extend the WSB clustering by take into account for the previous words in measuring the grammatical relation. The clustering results are then used in the English-Indonesia statistical machine translation. The experiments were conducted using MOSES as the machine translation decoder, and were evaluated by its BLEU score. Using 14.000 English-Indonesian sentence pairs, the clustering improved the BLEU score of 2.07%.

[1]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[2]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[3]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[4]  Kevin Knight,et al.  Minimized Models for Unsupervised Part-of-Speech Tagging , 2009, ACL/IJCNLP.

[5]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[6]  Jason S. Chang,et al.  A Class-based Approach to Word Alignment , 1997, CL.

[7]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[8]  Jeff Z. Ma,et al.  Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm , 2011, MTSUMMIT.

[9]  Elie Bienenstock,et al.  Latent-Descriptor Clustering for Unsupervised POS Induction , 2010, EMNLP.

[10]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[11]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[14]  Dietrich Klakow,et al.  A word clustering approach for language model-based sentence retrieval in question answering systems , 2009, CIKM.

[15]  Lori Levin,et al.  Semantically Informed Machine Translation ( SIMT ) , 2009 .

[16]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[17]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[18]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[19]  Wei Wang,et al.  Improving Word Alignment Models using Structured Monolingual Corpora , 2004, Conference on Empirical Methods in Natural Language Processing.

[20]  Ben Taskar,et al.  Posterior vs Parameter Sparsity in Latent Variable Models , 2009, NIPS.

[21]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[22]  Christian Biemann,et al.  Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering , 2006, ACL.

[23]  Dan Klein,et al.  Prototype-Driven Learning for Sequence Models , 2006, NAACL.