论文信息 - Linguistic resource creation for research and technology development: A recent experiment

Linguistic resource creation for research and technology development: A recent experiment

Advances in statistical machine learning encourage language-independent approaches to linguistic technology development. Experiments in "porting" technologies to handle new natural languages have revealed a great potential for multilingual computing, but also a frustrating lack of linguistic resources for most languages. Recent efforts to address the lack of available resources have focused either on intensive resource development for a small number of languages or development of technologies for rapid porting. The Linguistic Data Consortium recently participated in an experiment falling primarily under the first approach, the surprise language exercise. This article describes linguistic resource creation within this context, including the overall methodology for surveying and collecting language resources, as well as details of the resources developed during the exercise. The article concludes with discussion of a new approach to solving the problem of limited linguistic resources, one that has recently proven effective in identifying core linguistic resources for less common studied languages.

[1] Rayid Ghani,et al. Mining the web to create minority language corpora , 2001, CIKM '01.

[2] William J. Byrne,et al. Large vocabulary ASR for spontaneous czech in the MALACH project , 2003, INTERSPEECH.

[3] Sadaoki Furui. From Read Speech Recognition to Spontaneous Speech Understanding , 2001, NLPRS.

[4] Treebank Penn,et al. Linguistic Data Consortium , 1999 .

[5] Sarah L. Nesbeitt. Ethnologue: Languages of the World , 1999 .

[6] David Yarowsky,et al. Statistical Machine Translation: Final Report , 1999 .

[7] Mark Liberman,et al. TIDES Language Resources: A Resource Map for Translingual Information Access , 2002, LREC.

[8] William J. Byrne,et al. Large Vocabulary Speech Recognition for Read and Broadcast Czech , 1999, TSD.

[9] Mark Liberman,et al. A formal framework for linguistic annotation , 1999, Speech Commun..

[10] Maria Victoria R. Bunye,et al. Cebuano Grammar Notes , 1971 .

[11] Gary Simons,et al. Seven Dimensions of Portability for Language Documentation and Description , 2002, ArXiv.