The IUCL+ System: Word-Level Language Identification via Extended Markov Models

We describe the IUCL+ system for the shared task of the First Workshop on Computational Approaches to Code Switching (Solorio et al., 2014), in which participants were challenged to label each word in Twitter texts as a named entity or one of two candidate languages. Our system combines character n-gram probabilities, lexical probabilities, word label transition probabilities and existing named entity recognitiontools within a Markovmodel framework that weights these components and assigns a label. Our approach is language-independent, and we submitted results for all data sets (five test sets and three “surprise” sets, covering four language pairs), earning the highest accuracy score on the tweet level on two language pairs (Mandarin-English, Arabicdialects 1 & 2) and one of the surprise sets (Arabic-dialects).