An open diachronic corpus of historical Spanish

The impact-es diachronic corpus of historical Spanish compiles over one hundred books—containing approximately 8 million words—in addition to a complementary lexicon which links more than 10,000 lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in order to permit their intensive exploitation in linguistic research. Approximately 7 % of the words in the corpus (a selection aimed at enhancing the coverage of the most frequent word forms) have been annotated with their lemma, part of speech, and modern equivalent. This paper describes the annotation criteria followed and the standards, based on the Text Encoding Initiative recommendations, used to represent the texts in digital form.

[1]  Polonca Kocjančič Internet y los recursos lingüísticos para la lengua española : diccionarios y corpus , 2009 .

[2]  Pedro Sánchez-Prieto Borja,et al.  El Corpus de Documentos Españoles Anteriores a 1700 (CODEA) , 2009 .

[3]  Mirina Grosz,et al.  World Wide Web Consortium , 2010 .

[4]  Jesse de Does,et al.  Lexicon-supported OCR of eighteenth century Dutch books: a case study , 2013, Electronic Imaging.

[5]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[6]  Alfonso Medina Urrea,et al.  El Corpus Histórico del Español en México , 2011 .

[7]  Tomaz Erjavec,et al.  Lexicon Construction and Corpus Annotation of Historical Language with the CoBaLT Editor , 2012, LaTeCH@EACL.

[8]  Fred Spiring,et al.  Introduction to Statistical Quality Control , 2007, Technometrics.

[9]  Tomaz Erjavec,et al.  The goo300k corpus of historical Slovene , 2012, LREC.

[10]  Mark Davies,et al.  Un corpus anotado de 100.000.000 de palabras del español histórico y moderno , 2002, Proces. del Leng. Natural.

[11]  Paolo Missier,et al.  An experimental workflow development platform for historical document digitisation and analysis , 2011, HIP '11.

[12]  Mark Davies Creating useful historical corpora: a comparison of CORDE, the Corpus del español, and the Corpus do português , 2009 .

[13]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[14]  Gemma Boleda,et al.  Extending the tool, or how to annotate historical language varieties , 2011, LaTeCH@ACL.

[15]  Brigham Young The Corpus of Contemporary American English as the first reliable monitor corpus of English , 2010 .

[16]  Xavier Carreras,et al.  FreeLing: An Open-Source Suite of Language Analyzers , 2004, LREC.