Computational strategies for reducing annotation effort in language documentation

With the urgent need to document the world’s dying languages , it is important to explore ways to speed up language documentation efforts. One promising avenue is to use techniques from computational linguistics to automate some of the process. Here we consider unsupervised morphologica l segmentation and active learning for creating interlinear glossed text ( IGT) for the Mayan language Uspanteko. The practical goal is to produce a total ly annotated corpus that is as accurate as possible given limited time for man ual annotation. We discuss results from several experiments that suggest th ere is indeed much promise in these methods but also show that further developm ent is necessary to make them robustly useful for a wide range of conditions an d t sks. We also provide a detailed discussion of how two documentary lingui sts perceived machine support in IGT production and how their annotation p erformance varied with different levels of machine support. 1 LiLT Volume X, Issue Y, November 2009. Computational strategies for reducing annotation effort i n language documentation . Copyright c © 2009, CSLI Publications. 2 / L I LT VOLUME X, ISSUEY NOVEMBER 2009

[1]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[2]  J. Cocke,et al.  A Statistical Approach to Machine , 1990 .

[3]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[4]  Katrin Erk,et al.  IGT-XML: An XML Format for Interlinearized Glossed Text , 2007, LAW@ACL.

[5]  Christian Jacquemin,et al.  Guessing morphology from terms and corpora , 1997, SIGIR '97.

[6]  Katrin Erk,et al.  Minimally supervised lemmatization scheme induction through bilingual parallel corpora , 1998 .

[7]  Vera Demberg,et al.  A Language-Independent Unsupervised Model for Morphological Segmentation , 2007, ACL.

[8]  Steven Bird,et al.  Encoding and presenting interlinear text using XML technologies , 2003, ALTA.

[9]  Carolyn Penstein Rosé,et al.  Estimating Annotation Cost for Active Learning in a Multi-Annotator Environment , 2009, HLT-NAACL 2009.

[10]  Steven Bird,et al.  Managing Fieldwork Data with Toolbox and the Natural Language Toolkit , 2007 .

[11]  D. Crystal What is language death , 2002 .

[12]  Hoifung Poon,et al.  Unsupervised Morphological Segmentation with Log-Linear Models , 2009, NAACL.

[13]  Stephen F. Weiss,et al.  Word segmentation by letter successor varieties , 1974, Inf. Storage Retr..

[14]  Zellig S. Harris,et al.  From Phoneme to Morpheme , 1955 .

[15]  Jon Reyhner,et al.  Teaching indigenous languages , 1997 .

[16]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[17]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[18]  Michael Richards,et al.  Atlas lingüístico de Guatemala , 2003 .

[19]  N. A. Mcquown THE INDIGENOUS LANGUAGES OF LATIN AMERICA , 1955 .

[20]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[21]  Jason Baldridge,et al.  Evaluating Automation Strategies in Language Documentation , 2009, Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing - HLT '09.

[22]  Steven Bird,et al.  Functional Requirements for an Interlinear Text Editor , 2004, LREC.

[23]  Fei Xia,et al.  Automatically Identifying Computationally Relevant Typological Features , 2008, IJCNLP.

[24]  Jason Baldridge,et al.  How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[25]  Keren Rice Linguistic Fieldwork: Learning as one goes , 2001 .

[26]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[27]  Mathias Creutz,et al.  Unsupervised models for morpheme segmentation and morphology learning , 2007, TSLP.

[28]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[29]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Morphology Using Latent Semantic Analysis , 2000, CoNLL/LLL.

[30]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[31]  Steven Bird,et al.  Towards a general model of interlinear text , 2003 .

[32]  Fei Xia,et al.  Multilingual Structural Projection across Interlinear Text , 2007, HLT-NAACL.

[33]  Fei Xia,et al.  Repurposing Theoretical Linguistic Data for Tool Development and Search , 2008, IJCNLP.

[34]  Jason Baldridge,et al.  Unsupervised morphological segmentation and clustering with document boundaries , 2009, EMNLP.

[35]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[36]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[37]  Dennis L. Malone,et al.  Developing Curriculum Materials for Endangered Language Education: Lessons from the Field , 2003 .

[38]  Ronald Schroeter,et al.  EOPAS, the EthnoER online representation of interlinear text , 2006 .

[39]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[40]  Gary Simons,et al.  Seven Dimensions of Portability for Language Documentation and Description , 2002, ArXiv.

[41]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[42]  David P. Wilkins Linguistic research under aboriginal control: A personal account of fieldwork in central Australia 1 , 1992 .

[43]  Paul Newman,et al.  Fieldwork and Field Methods in Linguistics , 2009 .

[44]  William D. Lewis,et al.  The GOLD Community of Practice: an infrastructure for linguistic data on the Web , 2007, Lang. Resour. Evaluation.

[45]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[46]  Dayne Freitag,et al.  Morphology Induction from Term Clusters , 2005, CoNLL.

[47]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[48]  Dawn B. Stiles,et al.  Four Successful Indigenous Language Programs. , 1997 .

[49]  Fredrik Olsson,et al.  A Web Survey on the Use of Active Learning to Support Annotation of Text Data , 2009, HLT-NAACL 2009.