Building an open-source development infrastructure for language technology projects

The article presents the Giellatekno & Divvun language technology resources, more specifically the effort to utilise open-source tools to improve the build infrastructure, and the solutions to help adapt to best practices for software development. The article especially discusses how the infrastructure has been remade to cope with an increasing number of languages without incurring extra overhead for the maintainers, and at the same time let the linguists concentrate on the linguistic work. Finally, the article discusses how a uniform infrastructure like the one presented can be used to easily compare languages in terms of morphological or computational complexity, coverage or for cross-lingual applications.

[1]  Tommi A. Pirinen,et al.  HFST - Framework for Compiling and Applying Morphologies , 2011, SFCM.

[2]  Hannes Wettig,et al.  MDL-based Models for Alignment of Etymological Data , 2011, RANLP.

[3]  Fred Karlsson,et al.  Constraint Grammar as a Framework for Parsing Running Text , 1990, COLING.

[4]  Barbara Plank,et al.  Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) , 2010 .

[5]  Trond Trosterud A restricted freedom of choice: Linguistic diversity in the digital landscape , 2013 .

[6]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[7]  Donald E. Knuth,et al.  Literate Programming , 1984, Comput. J..

[8]  Montserrat Marimon,et al.  CLARIN: Common Language Resources and Technology Infrastructure , 2008, Proces. del Leng. Natural.

[9]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[10]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[11]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[12]  Kemal Oflazer,et al.  Error-tolerant Finite-state Recognition with Applications to Morphological Analysis and Spelling Correction , 1995, CL.

[13]  Adam Kolawa,et al.  Automated Defect Prevention , 2007 .

[14]  Lene Antonsen,et al.  Reusing Grammatical Resources for New Languages , 2010, LREC.

[15]  Adam Kolawa,et al.  Automated Defect Prevention , 2007 .

[16]  Maciej Piasecki,et al.  Building a Node of the Accessible Language Technology Infrastructure , 2010, LREC.

[17]  Marc Schröder,et al.  META-SHARE v2: An Open Network of Repositories for Language Resources including Data and Tools , 2012, LREC.

[18]  Yorick Wilks,et al.  Software Infrastructure for Natural Language Processing , 1997, ANLP.