论文信息 - Structured Knowledge for Low-Resource Languages : The Latin and Ancient Greek Dependency Treebanks

Structured Knowledge for Low-Resource Languages : The Latin and Ancient Greek Dependency Treebanks

We describe here our work in creating treebanks – large collections of syntactically annotated data – for Latin and Ancient Greek. While the treebanks themselves present important datasets for traditional research in philology and linguistics, the layers of structured knowledge they contain (including disambiguated lemma, morphological, and syntactic information for every word) help offset the comparatively small size of extant Greek and Latin texts for text mining applications. We describe two such uses for these Classical treebanks – discovering lexical knowledge from a large corpus with the help of a small treebank, and identifying patterns of text reuse.

Gregory Crane

[1] Hector Garcia-Molina,et al. Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[2] Stephen D. Richardson. Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users , 2002 .

[3] Justin Zobel,et al. Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[4] David Bamman,et al. Building a dynamic lexicon from a digital library , 2008, JCDL '08.

[5] Hector Garcia-Molina,et al. SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[6] David Bamman,et al. A Collaborative Model of Treebank Development , 2007 .

[7] Marco Carlo Passarotti,et al. Verso il Lessico Tomistico Biculturale. La treebank dell Index Thomisticus , 2007 .

[8] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9] Marius L. Jøhndal,et al. Creating a Parallel Treebank of the Old Indo-European BibleTranslations , 2008 .

[10] Fernando Pereira,et al. Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[11] Petra Storjohann,et al. ELEXIKO -A lexical and lexicological, corpus-based hypertext information system at the Institut für Deutsche Sprache, Mannheim , 2006 .