论文信息 - Improving Word Alignment Using Linguistic Code Switching Data - 字舞流文

Improving Word Alignment Using Linguistic Code Switching Data

Linguist Code Switching (LCS) is a situation where two or more languages show up in the context of a single conversation. For example, in EnglishChinese code switching, there might be a sentence like “· ‚15© ¨ k ‡meeting (We will have a meeting in 15 minutes)”. Traditional machine translation (MT) systems treat LCS data as noise, or just as regular sentences. However, if LCS data is processed intelligently, it can provide a useful signal for training word alignment and MT models. Moreover, LCS data is from non-news sources which can enhance the diversity of training data for MT. In this paper, we first extract constraints from this code switching data and then incorporate them into a word alignment model training procedure. We also show that by using the code switching data, we can jointly train a word alignment model and a language model using cotraining. Our techniques for incorporating LCS data improve by 2.64 in BLEU score over a baseline MT system trained using only standard sentence-aligned corpora.

Alexander Yates | Fei Huang | Fei Huang | A. Yates

[1] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[2] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[3] Chris Callison-Burch,et al. Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora , 2004, ACL.

[4] Ian H. Witten,et al. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[5] Yang Liu,et al. Learning to Predict Code-Switching Points , 2008, EMNLP.

[6] Hermann Ney,et al. HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[7] Alexander M. Fraser,et al. Semi-Supervised Training for Statistical Word Alignment , 2006, ACL.

[8] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9] Claudia Gdaniec,et al. Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation , 2011, SFCM.

[10] Stephan Vogel,et al. Active Semi-Supervised Learning for Improving Word Alignment , 2010, HLT-NAACL 2010.

[11] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[12] Wolfgang Macherey,et al. Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[13] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14] Chris Callison-Burch,et al. Co-training for Statistical Machine Translation , 2002 .

[15] Tan Lee,et al. Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[16] R. Sinha,et al. Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[17] Anna De Fina. Code-switching and the construction of ethnic identity in a community of practice , 2007, Language in Society.

[18] Shankar Kumar,et al. Improving Word Alignment with Bridge Languages , 2007, EMNLP.

[19] Andreas Eisele. Parallel Corpora and Phrase-Based Statistical Machine Translation for New Language Pairs via Multiple Intermediaries , 2006, LREC.

[20] Chad Nilep. "Code Switching" in Sociocultural Linguistics , 2006 .

[21] Ben Taskar,et al. Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[22] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[23] Ben Taskar,et al. Expectation Maximization and Posterior Constraints , 2007, NIPS.