Enriching Interlinear Text using Automatically Constructed Annotators

In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN) (Lewis and Xia, 2010). Using nothing more than IGT from this database and a classification-based projection approach tailored for IGT, we will show that it is feasible to train reasonably performing annotators of interlinear text using projected annotations for potentially hundreds of world’s languages. Doing so can facilitate automatic enrichment of interlinear resources to aid the field of linguistics.

[1]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[2]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[3]  Fei Xia,et al.  Multilingual Structural Projection across Interlinear Text , 2007, HLT-NAACL.

[4]  Fei Xia,et al.  Capturing divergence in dependency trees to improve syntactic projection , 2014, Lang. Resour. Evaluation.

[5]  Emily M. Bender,et al.  Learning Grammar Specifications from IGT: A Case Study of Chintang , 2014 .

[6]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[7]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[8]  Chris Brew,et al.  A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources , 2004, EMNLP.

[9]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[10]  Fei Xia,et al.  Automatically Identifying Computationally Relevant Typological Features , 2008, IJCNLP.

[11]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[12]  Emily M. Bender,et al.  Enriching ODIN , 2014, LREC.

[13]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[14]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[15]  Emily M. Bender Language CoLLAGE: Grammatical Description with the LinGO Grammar Matrix , 2014, LREC.

[16]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17]  Fei Xia,et al.  Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages , 2010, Lit. Linguistic Comput..

[18]  Emily M. Bender,et al.  Towards Creating Precision Grammars from Interlinear Glossed Text: Inferring Large-Scale Typological Properties , 2013, LaTeCH@ACL.

[19]  Fei Xia,et al.  Enriching, Editing, and Representing Interlinear Glossed Text , 2015, CICLing.

[20]  Mark Steedman,et al.  Two Decades of Unsupervised POS Induction: How Far Have We Come? , 2010, EMNLP.

[21]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.