Improving Word Alignment Using Linguistic Code Switching Data

Linguist Code Switching (LCS) is a situation where two or more languages show up in the context of a single conversation. For example, in EnglishChinese code switching, there might be a sentence like “· ‚15© ¨ k ‡meeting (We will have a meeting in 15 minutes)”. Traditional machine translation (MT) systems treat LCS data as noise, or just as regular sentences. However, if LCS data is processed intelligently, it can provide a useful signal for training word alignment and MT models. Moreover, LCS data is from non-news sources which can enhance the diversity of training data for MT. In this paper, we first extract constraints from this code switching data and then incorporate them into a word alignment model training procedure. We also show that by using the code switching data, we can jointly train a word alignment model and a language model using cotraining. Our techniques for incorporating LCS data improve by 2.64 in BLEU score over a baseline MT system trained using only standard sentence-aligned corpora.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[3]  Chris Callison-Burch,et al.  Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora , 2004, ACL.

[4]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[5]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[6]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[7]  Alexander M. Fraser,et al.  Semi-Supervised Training for Statistical Word Alignment , 2006, ACL.

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Claudia Gdaniec,et al.  Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation , 2011, SFCM.

[10]  Stephan Vogel,et al.  Active Semi-Supervised Learning for Improving Word Alignment , 2010, HLT-NAACL 2010.

[11]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[12]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Chris Callison-Burch,et al.  Co-training for Statistical Machine Translation , 2002 .

[15]  Tan Lee,et al.  Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[16]  R. Sinha,et al.  Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[17]  Anna De Fina Code-switching and the construction of ethnic identity in a community of practice , 2007, Language in Society.

[18]  Shankar Kumar,et al.  Improving Word Alignment with Bridge Languages , 2007, EMNLP.

[19]  Andreas Eisele Parallel Corpora and Phrase-Based Statistical Machine Translation for New Language Pairs via Multiple Intermediaries , 2006, LREC.

[20]  Chad Nilep "Code Switching" in Sociocultural Linguistics , 2006 .

[21]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[22]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[23]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.