Detecting de minimis Code-Switching in Historical German Books

Code-switching has long interested linguists, with computational work in particular focusing on speech and social media data (Sitaram et al., 2019). This paper contrasts these informal instances of code-switching to its appearance in more formal registers, by examining the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries. We automatically annotate and manually inspect spans of six embedded languages (Latin, French, English, Italian, Spanish, and Greek) in the corpus. We quantitatively analyze the differences between code-switching patterns in these books and those in more typically studied speech and social media corpora. Furthermore, we address the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Such classifiers can help reduce errors when optical character recognition or speech transcription is applied to a large corpus with rare embedded languages.

[1]  Miguel A. Alonso,et al.  Sentiment Analysis on Monolingual, Multilingual and Code-Switching Twitter Corpora , 2015, WASSA@EMNLP.

[2]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[5]  Thamar Solorio,et al.  LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation , 2020, LREC.

[6]  Barbara E. Bullock,et al.  Metrics for Modeling Code-Switching Across Corpora , 2017, INTERSPEECH.

[7]  Tan Lee,et al.  Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[8]  Sarah Schulz,et al.  Code-Switching Ubique Est - Language Identification and Part-of-Speech Tagging for Historical Mixed Text , 2016, LaTeCH@ACL.

[9]  Alan W. Black,et al.  A Survey of Code-switched Speech and Language Processing , 2019, ArXiv.

[10]  Sara Tonelli,et al.  A little bit of bella pianura: Detecting Code-Mixing in Historical English Travel Writing , 2017, CLiC-it.

[11]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12]  Kwang-Il Goh,et al.  Burstiness and memory in complex systems , 2006 .

[13]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[14]  Penelope Gardner-Chloros,et al.  The LIDES coding manual: a document for preparing and analyzing language interaction data version , 2000 .

[15]  Chad Nilep "Code Switching" in Sociocultural Linguistics , 2006 .

[16]  Josef van Genabith,et al.  Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques , 2018, CodeSwitch@ACL.

[17]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[18]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[19]  Haizhou Li,et al.  Integration of language identification into a recognition system for spoken conversations containing code-Switches , 2012, SLTU.

[20]  Vinay Singh,et al.  Named Entity Recognition for Hindi-English Code-Mixed Social Media Text , 2018, NEWS@ACL.

[21]  Barbara E. Bullock,et al.  The limits of Spanglish? , 2019, LaTeCH@NAACL-HLT.