Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers

This research begins by distinguishing a small number of “central” languages from the “noncentral languages”, where centrality is measured by the extent to which a given language is supported by natural language processing tools and research. We analyse the conditions under which noncentral language projects (NCLPs) and central language projects are conducted. We establish a number of important differences which have far-reaching consequences for NCLPs. In order to overcome the difficulties inherent in NCLPs, traditional research strategies have to be reconsidered. Successful styles of scientific cooperation, such as those found in open-source software development or in the development of the Wikipedia, provide alternative views of how NCLPs might be designed. We elaborate the concepts of free software and software pools and argue that NCLPs, in their own interests, should embrace an open-source approach for the resources they develop and pool these resources together with other similar open-source resources. The expected advantages of this approach are so important that we suggest that funding organizations put it as sine qua non condition into project contracts.

[1]  Trond Trosterud,et al.  From Xerox to Aspell: A First Prototype of a North Sámi Speller Based on TWOL Technology , 2005, FSMNLP.

[2]  A. Diaz De Ilarraza HIZKING21: Integrating language engineering resources and tools into systems with linguistic capa , 2003 .

[3]  Díaz de Ilarraza,et al.  Towards the definition of a basic toolkit for HLT , 2002 .

[4]  Hermann Ney,et al.  Statistical Sign Language Translation , 2004 .

[5]  Stefan Gradmann,et al.  The Language Belongs to the People! , 2004, LREC.

[6]  Harold L. SOMERS “ New paradigms ” in MT : the state of play now that the dust has settled , 2003 .

[7]  Oliver Streiter,et al.  XNLRDF, an Open Source Natural Language Resource Description Framework , 2005, PACLIC.

[8]  Andy Way,et al.  An Example-Based Approach to Translating Sign Language , 2005, MTSUMMIT.

[9]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[10]  David Nathan,et al.  Multimedia and the documentation of endangered languages , 2003 .

[11]  M. Forcada Open-source machine translation : an opportunity for minor languages , 2006 .

[12]  Cédrick Fairon,et al.  Building and Exploring Web Corpora. Proceedings of the 3rd web as corpus workshop, incorporating cleaneval , 2007 .

[13]  T. Kuhn The structure of scientific revolutions, 3rd ed. , 1996 .

[14]  K. Sarasola Strategic priorities for the development of language technology in minority languages , 2000 .

[15]  Baden Hughes,et al.  Frontiers in Linguistic Annotation for Lower-Density Languages , 2006 .

[16]  P. Eisenlohr Language Revitalization and New TECHNOLOGIES: Cultures of Electronic Mediation and the Refiguring of Communities , 2004 .

[17]  Paul Kavanagh,et al.  The Open Source Definition , 2004 .

[18]  Kevin P. Scannell Machine translation for closely related language pairs , 2022 .

[19]  Ernesto William De Luca,et al.  Example-based NLP for Minority Languages: Tasks, Resources and Tools , 2003 .

[20]  T. Kuhn,et al.  The Structure of Scientific Revolutions , 1963 .

[21]  Alon Lavie,et al.  MT for Minority Languages Using Elicitation-Based Learning of Syntactic Transfer Rules , 2002, Machine Translation.

[22]  Mikel L. Forcada,et al.  An Open-Source Shallow-Transfer Machine Translation Toolbox: Consequences of Its Release and Availability , 2005, MTSUMMIT.

[23]  Rebecca S. Guenther,et al.  Practical Preservation: The PREMIS Experience , 2005, Libr. Trends.