A little bit of bella pianura: Detecting Code-Mixing in Historical English Travel Writing

English. Code-mixing is the alternation between two or more languages in the same text. This phenomenon is very relevant in the travel domain, since it can provide new insight in the way foreign cultures are perceived and described to the readers. In this paper, we analyse EnglishItalian code-mixing in historical English travel writings about Italy. We retrain and compare two existing systems for the automatic detection of code-mixing, and analyse the semantic categories mostly connected to Italian. Besides, we release the domain corpus used in our experiments and the output of the extraction. Italiano. Il code-mixing è l’alternanza di lingue diverse nello stesso testo. Questo fenomeno è particolarmente importante nel dominio dei viaggi, poiché aiuta a comprendere meglio il modo in cui vengono percepite e descritte culture diverse da quella dell’autore. In questo lavoro, analizziamo il code-mixing tra inglese ed italiano nei testi di viaggio scritti in inglese e aventi come soggetto l’Italia. A questo scopo confrontiamo due sistemi esistenti per il riconoscimento automatico del code-mixing dopo averli ri-addestrati e analizziamo le categorie semantiche connesse alle parole/espressioni italiane. Inoltre, rilasciamo il corpus e il risultato

[1]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[2]  François Grosjean,et al.  One speaker, two languages: A psycholinguistic approach to code-switching: the recognition of guest words by bilinguals , 1995 .

[3]  Pieter Muysken,et al.  Bilingual Speech: A Typology of Code-Mixing , 2000 .

[4]  Amitava Das,et al.  Code-Mixing in Social Media Text. The Last Language Identification Frontier? , 2013, Trait. Autom. des Langues.

[5]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[6]  Gokul Chittaranjan,et al.  Overview of FIRE 2014 Track on Transliterated Search , 2014 .

[7]  Mike Scott Wordsmith Tools version 3 , 1997 .

[8]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[9]  Marco Baroni,et al.  Morph-it! A free corpus-based morphological resource for the Italian language , 2005 .

[10]  Marine Carpuat,et al.  Mixed Language and Code-Switching in the Canadian Hansard , 2014, CodeSwitch@EMNLP.

[11]  G. Dann Language of Tourism: A Sociolinguistic Perspective , 1996 .

[12]  Sarah Schulz,et al.  Code-Switching Ubique Est - Language Identification and Part-of-Speech Tagging for Historical Mixed Text , 2016, LaTeCH@ACL.

[13]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[14]  S. Gandin INVESTIGATING LOAN WORDS AND EXPRESSIONS IN TOURISM DISCOURSE: A CORPUS DRIVEN ANALYSIS ON THE BBCTRAVEL CORPUS , 2014 .

[15]  Tommaso Caselli,et al.  The Content Types Dataset: a New Resource to Explore Semantic and Functional Characteristics of Texts , 2017, EACL.

[16]  A. Jaworski,et al.  The Uses and Representations of Local Languages in Tourist Destinations: A View from British TV Holiday Programmes , 2003 .

[17]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[18]  Paul Rayson,et al.  Development of the Multilingual Semantic Annotation System , 2015, NAACL.

[19]  Rouzbeh A. Shirvani,et al.  Word-Level Language Identification and Predicting Codeswitching Points in Swahili-English Language Data , 2016, CodeSwitch@EMNLP.

[20]  G. Cappelli Travelling words: Languaging in English tourism discourse , 2013 .

[21]  Dong Nguyen,et al.  Predicting Code-switching in Multilingual Communication for Immigrant Communities , 2014, CodeSwitch@EMNLP.

[22]  Woon Yee,et al.  Code-mixing : linguistic form and socio-cultural meaning , 2007 .

[23]  Jennifer-Carmen Frey,et al.  The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts , 2016, CLiC-it/EVALITA.