论文信息 - Development of a Guarani - Spanish Parallel Corpus

Development of a Guarani - Spanish Parallel Corpus

This paper presents the development of a Guarani - Spanish parallel corpus with sentence-level alignment. The Guarani sentences of the corpus use the Jopara Guarani dialect, the dialect of Guarani spoken in Paraguay, which is based on Guarani grammar and may include several Spanish loanwords or neologisms. The corpus has around 14,500 sentence pairs aligned using a semi-automatic process, containing 228,000 Guarani tokens and 336,000 Spanish tokens extracted from web sources.

[1] Bruno Estigarribia,et al. Guarani Linguistics in the 21st Century , 2017 .

[2] Wolf Lustig,et al. Mba'éichapa oiko la guarani? Guaraní y jopara en el Paraguay Consideraciones preliminares sobre la dificultad de abordar un idioma esquivo , 2010 .

[3] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[4] Guillaume Thomas,et al. Universal Dependencies for Mbyá Guaraní , 2019, Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019).

[5] Michael Gasser,et al. Mainumby: un Ayudante para la Traducción Castellano-Guaraní , 2018, ArXiv.

[6] Wiebke Wagner,et al. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[7] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.