论文信息 - Incremental N-gram Approach for Language Identification in Code-Switched Text

Incremental N-gram Approach for Language Identification in Code-Switched Text

A multilingual person writing a sentence or a piece of text tends to switch between languages s/he is proficient in. This alteration between languages, commonly known as code-switching, presents us with the problem of determining the correct language of each word in the text. My method uses a variety of techniques based upon the observed differences in the formation of words in these languages. My system was able to obtain third position in both tweet and token level for the main test dataset as well as first position in the token level evaluation for the surprise dataset both consisting of Nepali-English codeswitched texts.

Prajwol Shrestha

[1] Thamar Solorio,et al. Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[2] Julia Hirschberg,et al. Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[3] J. Auer,et al. A conversation analytic approach to code-switching and transfer , 2003 .

[4] Dong Nguyen,et al. Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[5] Mona T. Diab,et al. Token Level Identification of Linguistic Code Switching , 2012, COLING.

[6] Pascale Fung,et al. A Hindi-English Code-Switching Corpus , 2014, LREC.

[7] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .