An Efficient Method for Determining Bilingual Word Classes

In statistical natural language processing we always face the problem of sparse data. One way to reduce this problem is to group words into equivalence classes which is a standard method in statistical language modeling. In this paper we describe a method to determine bilingual word classes suitable for statistical machine translation. We develop an optimization criterion based on a maximum-likelihood approach and describe a clustering algorithm. We will show that the usage of the bilingual word classes we get can improve statistical machine translation.

[1]  Gerhard W. Dueck,et al.  Threshold accepting: a general purpose optimization algorithm appearing superior to simulated anneal , 1990 .

[2]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[3]  Hermann Ney,et al.  Forming Word Classes by Statistical Clustering for Statistical Language Modelling , 1993 .

[4]  Gilles Adda,et al.  Automatic word classification using simulated annealing , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[6]  Hermann Ney,et al.  Improved clustering techniques for class-based statistical language modelling , 1993, EUROSPEECH.

[7]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[8]  Pascale Fung,et al.  Coerced Markov Models for Cross-Lingual Lexical-Tag Relations , 1995, TMI.

[9]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[10]  Alexander H. Waibel,et al.  Word clustering with parallel spoken language corpora , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  Enrique Vidal,et al.  Finite-state speech-to-speech translation , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Franz Josef Och,et al.  Improving Statistical Natural Language Translation with Categories and Rules , 1998, ACL.