论文信息 - Language Preservation: A Case Study in Collecting and Digitizing Machine-Tractable Language Data

Language Preservation: A Case Study in Collecting and Digitizing Machine-Tractable Language Data

In this paper we describe a process for collecting and digitizing machine-tractable resources for lesser-studied languages. We illustrate this process by using examples from the Paraguayan indigenous language Guarani, Chechen, and other languages. By ‘machine-tractable’ we mean that in addition to being readable by people, the resource can also be processed by a computational tool. Our goal in acquiring these resources is to use them for quick ramp-up machine translation. In related work, Nirenburg et al. developed an elicitation system that would guide non-expert language informants through questions about the ecology, inflectional morphology, and syntax of their language and also would lead them through a lexicon development task.1 This information was then used to automatically generate a transfer machine translation system. Our approach replaces this rigid, guided process with the more free-form acquisition of general resources, which could be used by experts to create a machine translation system.

Ron Zacharski | Jim Cowie | Steve Helmreich

[1] Michele Banko,et al. Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[2] Sergei Nirenburg,et al. Embedding Knowledge Elicitation and MT Systems within a Single Architecture , 2005, Machine Translation.