Enriching a massively multilingual database of interlinear glossed text

The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swath of the world’s languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. We propose that Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics, has great potential for bootstrapping NLP tools for resource-poor languages. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we describe the expansion of the ODIN resource—a database containing many thousands of instances of IGT for over a thousand languages. We enrich the original IGT data by adding word alignment and syntactic structure. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we adopt and extend a new XML format for IGT, called Xigt. We also develop two packages for manipulating IGT data: one, INTENT, enriches raw IGT automatically, and the other, XigtEdit, is a graphical IGT editor.

[1]  Emily M. Bender,et al.  Towards Creating Precision Grammars from Interlinear Glossed Text: Inferring Large-Scale Typological Properties , 2013, LaTeCH@ACL.

[2]  Fei Xia,et al.  Multilingual Structural Projection across Interlinear Text , 2007, HLT-NAACL.

[3]  Chris Brew,et al.  Tagging Portuguese with a Spanish tagger using cognates , 2006 .

[4]  Joakim Nivre,et al.  MaltParser: A Data-Driven Parser-Generator for Dependency Parsing , 2006, LREC.

[5]  Fei Xia,et al.  The Problems of Language Identification within Hugely Multilingual Data Sets , 2010, LREC.

[6]  Joakim Nivre,et al.  Target Language Adaptation of Discriminative Transfer Parsers , 2013, NAACL.

[7]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[8]  Fei Xia,et al.  Repurposing Theoretical Linguistic Data for Tool Development and Search , 2008, IJCNLP.

[9]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[10]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[11]  Bonnie J. Dorr,et al.  Machine Translation Divergences: A Formal Description and Proposed Solution , 1994, CL.

[12]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[13]  Fei Xia,et al.  Capturing divergence in dependency trees to improve syntactic projection , 2014, Lang. Resour. Evaluation.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Emily M. Bender,et al.  Xigt: extensible interlinear glossed text for natural language processing , 2015, Lang. Resour. Evaluation.

[16]  Jakob Uszkoreit,et al.  Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure , 2012, NAACL.

[17]  Fei Xia,et al.  Language ID in the Context of Harvesting Language Data off the Web , 2009, EACL.

[18]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[19]  Min Xiao,et al.  Annotation Projection-based Representation Learning for Cross-lingual Dependency Parsing , 2015, CoNLL.

[20]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[21]  Joan L. Bybee,et al.  The Creation of Tense and Aspect Systems in the Languages of the World , 1989 .

[22]  Fei Xia,et al.  Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages , 2010, Lit. Linguistic Comput..

[23]  Fei Xia,et al.  Enriching Interlinear Text using Automatically Constructed Annotators , 2015, LaTeCH@ACL.

[24]  William Lewis Mining and Migrating Interlinear Glossed Text , 2003 .

[25]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[26]  Mirella Lapata,et al.  The European Chapter of the Association for Computational Linguistics (EACL 2017) , 2017 .

[27]  Chris Brew,et al.  A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources , 2006, LREC.

[28]  William Lewis,et al.  Building a Knowledge Base of Morphosyntactic Terminology , 2001 .

[29]  Fei Xia,et al.  Unsupervised Dependency Parsing with Transferring Distribution via Parallel Guidance and Entropy Regularization , 2014, ACL.

[30]  Louisa Sadler,et al.  Structural Non-Correspondence in Translation , 1991, EACL.

[31]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[32]  Fei Xia,et al.  Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text , 2013, ACL.

[33]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[34]  Fei Xia,et al.  Automatically Identifying Computationally Relevant Typological Features , 2008, IJCNLP.

[35]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.