Structured Knowledge for Low-Resource Languages : The Latin and Ancient Greek Dependency Treebanks

We describe here our work in creating treebanks – large collections of syntactically annotated data – for Latin and Ancient Greek. While the treebanks themselves present important datasets for traditional research in philology and linguistics, the layers of structured knowledge they contain (including disambiguated lemma, morphological, and syntactic information for every word) help offset the comparatively small size of extant Greek and Latin texts for text mining applications. We describe two such uses for these Classical treebanks – discovering lexical knowledge from a large corpus with the help of a small treebank, and identifying patterns of text reuse.

[1]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[2]  Stephen D. Richardson Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users , 2002 .

[3]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[4]  David Bamman,et al.  Building a dynamic lexicon from a digital library , 2008, JCDL '08.

[5]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[6]  David Bamman,et al.  A Collaborative Model of Treebank Development , 2007 .

[7]  Marco Carlo Passarotti,et al.  Verso il Lessico Tomistico Biculturale. La treebank dell Index Thomisticus , 2007 .

[8]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9]  Marius L. Jøhndal,et al.  Creating a Parallel Treebank of the Old Indo-European BibleTranslations , 2008 .

[10]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[11]  Petra Storjohann,et al.  ELEXIKO -A lexical and lexicological, corpus-based hypertext information system at the Institut für Deutsche Sprache, Mannheim , 2006 .

[12]  David Bamman,et al.  The Logic and Discovery of Textual Allusion , 2008 .

[13]  W. Bruce Croft,et al.  Local text reuse detection , 2008, SIGIR '08.

[14]  John Sinclair,et al.  Looking up : an account of the COBUILD Project in lexical computing and the development of the Collins COBUILD English Language Dictionary , 1987 .

[15]  Roberto Busa,et al.  Index Thomisticus : Sancti Thomae Aquinatis operum omnium indices et concordantiae in quibus verborum omnium et singulorum formae et lemmata cum suis frequentiis et contextibus variis modis referuntur , 1974 .

[16]  Adam Kilgarriff,et al.  The Sketch Engine , 2004 .

[17]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[18]  Harm Pinkster,et al.  Latin syntax and semantics , 1990 .

[19]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[20]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.