Recurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus

Mixed language data is one of the difficult yet less explored domains of natural language processing. Most research in fields like machine translation or sentiment analysis assume monolingual input. However, people who are capable of using more than one language often communicate using multiple languages at the same time. Sociolinguists believe this "code-switching" phenomenon to be socially motivated. For example, to express solidarity or to establish authority. Most past work depend on external tools or resources, such as part-of-speech tagging, dictionary look-up, or named-entity recognizers to extract rich features for training machine learning models. In this paper, we train recurrent neural networks with only raw features, and use word embedding to automatically learn meaningful representations. Using the same mixed-language Twitter corpus, our system is able to outperform the best SVM-based systems reported in the EMNLP'14 Code-Switching Workshop by 1% in accuracy, or by 17% in error rate reduction.

[1]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[2]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[3]  Nachum Dershowitz,et al.  The Tel Aviv University System for the Code-Switching Workshop Shared Task , 2014, CodeSwitch@EMNLP.

[4]  Almeida Jacqueline Toribio,et al.  Code Switching and X-Bar Theory : The Functional Head Constraint , 2008 .

[5]  D. Sankoff A formal production-based explanation of the facts of code-switching , 1998, Bilingualism: Language and Cognition.

[6]  Chris Dyer,et al.  The CMU Submission for the Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[7]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[8]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[9]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[10]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[11]  Pascale Fung,et al.  Language Modeling with Functional Head Constraint for Code Switching Speech Recognition , 2014, EMNLP.

[12]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[13]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[14]  Susan Berk-Seligson,et al.  Linguistic constraints on intrasentential code-switching: A study of Spanish/Hebrew bilingualism , 1986, Language in Society.

[15]  R. Jalam,et al.  Kernel-based text categorisation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[16]  Almeida Jacqueline Toribio,et al.  Code switching and X-bar theory: the fuctional head constraint , 1994 .

[17]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[18]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[19]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[20]  Julian M. Kupiec,et al.  Robust part-of-speech tagging using a hidden Markov model , 1992 .

[21]  Natascha Müller,et al.  Un nase or una nase? What gender marking within switched DPs reveals about the architecture of the bilingual language faculty , 2008 .

[22]  John J. Gumperz,et al.  Discourse Strategies (Studies in Interactional Sociolinguistics 1) , 1986 .

[23]  Sandra Kübler,et al.  The IUCL+ System: Word-Level Language Identification via Extended Markov Models , 2014, CodeSwitch@EMNLP.

[24]  Wang Ling,et al.  Microblogs as Parallel Corpora , 2013, ACL.

[25]  Álvaro Herrero,et al.  International Joint Conference - CISIS'15 and ICEUTE'15, 8th International Conference on Computational Intelligence in Security for Information Systems / 6th International Conference on EUropean Transnational Education, Burgos, Spain, 15-17 June, 2015 , 2015, CISIS-ICEUTE.

[26]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[27]  Robin Nagano,et al.  Language Identification of Web Pages Based on Improved N-gram Algorithm , 2011 .

[28]  Michael I. Jordan Serial Order: A Parallel Distributed Processing Approach , 1997 .

[29]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[30]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[31]  Grzegorz Chrupala,et al.  DCU-UVT: Word-Level Language Classification with Code-Mixed Data , 2014, CodeSwitch@EMNLP.

[32]  H. Isahara,et al.  Language identification based on string kernels , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..