A Three-Step Model of Language Detection in Multilingual Ancient Texts

Ancient corpora contain various multilingual patterns. This imposes numerous problems on their manual annotation and automatic processing. We introduce a lexicon building system, called Lexicon Expander, that has an integrated language detection module, Language Detection (LD) Toolkit. The Lexicon Expander post-processes the output of the LD Toolkit which leads to the improvement of f-score and accuracy values. Furthermore, the functionality of the Lexicon Expander also includes manual editing of lexical entries and automatic morphological expansion by means of a morphological grammar.

[1]  Alexander Mehler,et al.  Der eHumanities Desktop als Werkzeug in der historischen Semantik: Funktionsspektrum und Einsatzszenarien , 2011, J. Lang. Technol. Comput. Linguistics.

[2]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[3]  Gerhard Heyer,et al.  eAQUA-Bringing modern Text Mining appraoches to two thousand years old ancient texts , 2008 .

[4]  Ulli Waltinger,et al.  On social semantics in information retrieval , 2010 .

[5]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[6]  Leaving Behind the Less-Resourced Status. The Case of Latin through the Experience of the Index Thomisticus Treebank , 2010 .

[7]  Tomaz Erjavec,et al.  Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene , 2011, LaTeCH@ACL.

[8]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[9]  Alexander Mehler,et al.  eHumanities Desktop - An extensible Online System for Corpus Management and Analysis , 2009 .

[10]  Alexander Mehler,et al.  eHumanities Desktop - An Online System for Corpus Management and Analysis in Support of Computing in the Humanities , 2009, EACL.

[11]  David Bamman,et al.  The Annotation Guidelines of the Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some specific Syntactic Constructions in Latin , 2008, LREC.

[12]  Tomaž Erjavec,et al.  A lexicon for processing archaic language: the case of XIX , 2011 .

[13]  Marco Carlo Passarotti,et al.  Development and perspectives of the Latin morphological analyser LEMLAT , 2004 .

[14]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[15]  Alexander Mehler,et al.  Multilingualism in Ancient Texts : Language Detection by Example of Old High German and Old Saxon , 2011 .

[16]  Alexander Mehler,et al.  The Feature Difference Coefficient: Classification by Means of Feature Distributions , 2009 .

[17]  Ulrike Mosel,et al.  Essentials of language documentation , 2006 .

[18]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange : TEI P4 , 2002 .

[19]  Nils Diewald,et al.  Evolution of Romance Language in Written Communication: Network Analysis of Late Latin and Early Romance Corpora , 2011, Leonardo.

[20]  Gregory Crane Structured Knowledge for Low-Resource Languages : The Latin and Ancient Greek Dependency Treebanks , 2009 .

[21]  Cornelis H. A. Koster Constructing a Parser for Latin , 2005, CICLing.

[22]  Klaus U. Schulz,et al.  Enabling information retrieval on historical document collections: the role of matching procedures and special lexica , 2009, AND '09.

[23]  Jeffrey A. Rydberg-Cox,et al.  The Perseus Project: a Digital Library for the Humanities , 2000 .