The CMU Submission for the Shared Task on Language Identification in Code-Switched Data

We describe the CMU submission for the 2014 shared task on language identification in code-switched data. We participated in all four language pairs: Spanish‐English, Mandarin‐English, Nepali‐English, and Modern Standard Arabic‐Arabic dialects. After describing our CRF-based baseline system, we discuss three extensions for learning from unlabeled data: semi-supervised learning, word embeddings, and word lists.

[1]  Ying Li,et al.  Improved mixed language speech recognition using asymmetric acoustic model and language model with code-switch inversion constraints , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[5]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[6]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[7]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[8]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[9]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[12]  John J. Gumperz,et al.  Discourse Strategies (Studies in Interactional Sociolinguistics 1) , 1986 .

[13]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[14]  Ying Li,et al.  Code switch language modeling with Functional Head Constraint , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Mona T. Diab,et al.  Code Switch Point Detection in Arabic , 2013, NLDB.

[16]  Noah A. Smith,et al.  Conditional Random Field Autoencoders for Unsupervised Structured Prediction , 2014, NIPS.

[17]  Jagadeesh Gorla,et al.  Identification of Languages and Encodings in a Multilingual Document , 2007 .