Mixed Language and Code-Switching in the Canadian Hansard

While there has been lots of interest in code-switching in informal text such as tweets and online content, we ask whether code-switching occurs in the proceedings of multilingual institutions. We focus on the Canadian Hansard, and automatically detect mixed language segments based on simple corpus-based rules and an existing word-level language tagger. Manual evaluation shows that the performance of automatic detection varies significantly depending on the primary language. While 95% precision can be achieved when the original language is French, common words generate many false positives which hurt precision in English. Furthermore, we found that codeswitching does occur within the mixed languages examples detected in the Canadian Hansard, and it might be used differently by French and English speakers. This analysis suggests that parallel corpora such as the Hansard can provide interesting test beds for studying multilingual practices, including code-switching and its translation, and encourages us to collect more gold annotations to improve the characterization and detection of mixed language and code-switching in parallel corpora.

[1]  Shuly Wintner,et al.  On the features of translationese , 2015, Digit. Scholarsh. Humanit..

[2]  Antal van den Bosch,et al.  Translation Assistance by Translation of L1 Fragments in an L2 Context , 2014, ACL.

[3]  Alexander Yates,et al.  Improving Word Alignment Using Linguistic Code Switching Data , 2014, EACL.

[4]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[5]  Mona T. Diab,et al.  Code Switch Point Detection in Arabic , 2013, NLDB.

[6]  Michal Krzyzanowski,et al.  The interplay of language ideologies and contextual cues in multilingual interactions: Language choice and code-switching in European Union institutions , 2012, Language in Society.

[7]  Claudia Gdaniec,et al.  Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation , 2011, SFCM.

[8]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[9]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[10]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[11]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[12]  Pascale Fung,et al.  Mixed Language Query Disambiguation , 1999, ACL.

[13]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[14]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[15]  Ryan Cotterell,et al.  An Algerian Arabic-French Code-Switched Corpus , 2014 .

[16]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[17]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.