Frontiers in Linguistic Annotation for Lower-Density Languages

The languages that are most commonly subject to linguistic annotation on a large scale tend to be those with the largest populations or with recent histories of linguistic scholarship. In this paper we discuss the problems associated with lower-density languages in the context of the development of linguistically annotated resources. We frame our work with three key questions regarding the definition of lower-density languages; increasing available resources and reducing data requirements. A number of steps forward are identified for increasing the number lower-density language corpora with linguistic annotations.

[1]  Tony McEnery,et al.  Corpus Resources and Minority Language Engineering , 2000, LREC.

[2]  M. de Rijke,et al.  Blueprint of a Cross-Lingual Web Retrieval Collection , 2005, J. Digit. Inf. Manag..

[3]  Alon Lavie,et al.  Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario , 2003, TALIP.

[4]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[5]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[6]  Nancy Ide,et al.  A Registry of Standard Data Categories for Linguistic Annotation , 2004, LREC.

[7]  Rayid Ghani,et al.  Mining the web to create minority language corpora , 2001, CIKM '01.

[8]  Alon Lavie,et al.  MT for Minority Languages Using Elicitation-Based Learning of Syntactic Transfer Rules , 2002, Machine Translation.

[9]  Christopher Cieri,et al.  Linguistic resource creation for research and technology development: A recent experiment , 2003, TALIP.

[10]  Kevin P. Scannell Automatic thesaurus generation for minority languages: an Irish example , 2003 .

[11]  Kevin P. Scannell Machine translation for closely related language pairs , 2022 .

[12]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[13]  Douglas W. Oard,et al.  The surprise language exercises , 2003, TALIP.

[14]  Sergei Nirenburg,et al.  Universal Grammar and Lexis for Quick Ramp-Up of MT Systems , 1998, ACL.

[15]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[16]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[17]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.