Code-Switching Ubique Est - Language Identification and Part-of-Speech Tagging for Historical Mixed Text

In this paper, we describe the development of a language identification system and a part-of-speech tagger for Latin-Middle English mixed text. To this end, we annotate data with language IDs and Universal POS tags (Petrov et al., 2012). As a classifier, we train a conditional random field classifier for both sub-tasks, including features generated by the TreeTagger models of both languages. The focus lies on both a general and a task-specific evaluation. Moreover, we describe our effort concerning beyond proof-of-concept implementation of tools and towards a more task-oriented approach, showing how to apply our techniques in the context of Humanities research.

[1]  Tien Ping Tan,et al.  Applying Grapheme, Word, and Syllable Information for Language Identification in Code Switching Sentences , 2011, 2011 International Conference on Asian Language Processing.

[2]  C. Hume Multilingualism in Medieval Britain (c. 1066-­1520): Sources and Analysis , 2013 .

[3]  David Bamman,et al.  The Ancient Greek and Latin Dependency Treebanks , 2011, Language Technology for Cultural Heritage.

[4]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[5]  Ad Putter,et al.  Code-Switching in Early English , 2011 .

[6]  Riyaz Ahmad Bhat,et al.  Language Identification in Code-Switching Scenario , 2014, CodeSwitch@EMNLP.

[7]  Marius L. Jøhndal,et al.  Creating a Parallel Treebank of the Old Indo-European BibleTranslations , 2008 .

[8]  Sandra Kübler,et al.  Part of Speech Tagging Bilingual Speech Transcripts with Intrasentential Model Switching , 2013, AAAI Spring Symposium: Analyzing Microtext.

[9]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[10]  Carol Myers-Scotton,et al.  The matrix language frame model: Developments and responses , 2001 .

[11]  Amitava Das,et al.  Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages , 2015, RANLP.

[12]  Jonas Kuhn,et al.  ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks , 2013, ACL.

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[15]  Alexander Mehler,et al.  TLT-CRF: A Lexicon-supported Morphological Tagger for Latin Based on Conditional Random Fields , 2016, LREC.

[16]  Barbara McGillivray,et al.  The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon , 2009, TAL.

[17]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[18]  Dau-Cheng Lyu,et al.  Language identification on code-switching utterances using multiple cues , 2008, INTERSPEECH.

[19]  Amitava Das,et al.  Code-Mixing in Social Media Text. The Last Language Identification Frontier? , 2013, Trait. Autom. des Langues.

[20]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[21]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[22]  Siegfried Wenzel Macaronic Sermons: Bilingualism and Preaching in Late-Medieval England , 1994 .

[23]  P. Horner,et al.  A macaronic sermon collection from late medieval England : Oxford, MS Bodley 649 , 2006 .