Breaking the Resource Bottleneck for Multilingual Parsing

Abstract : We propose a framework that enables the acquisition of annotation-heavy resources such as syntactic dependency tree corpora for low-resource languages by importing linguistic annotations from high-quality English resources. We present a large-scale experiment showing that Chinese dependency trees can be induced by using an English parser, a word alignment package, and a large corpus of sentence-aligned bilingual text. As a part of the experiment, we evaluate the quality of a Chinese parser trained on the induced dependency treebank. We find that a parser trained in this manner out-performs some simple baselines inspite of the noise in the induced treebank. The results suggest that projecting syntactic structures from English is a viable option for acquiring annotated syntactic structures quickly and cheaply. We expect the quality of the induced treebank to improve when more sophisticated filtering and error-correction techniques are applied.

[1]  Dekang Lin,et al.  A dependency-based method for evaluating broad-coverage parsers , 1995, Natural Language Engineering.

[2]  Fei Xia,et al.  Converting Dependency Structures to Phrase Structures , 2001, HLT.

[3]  Philip Resnik,et al.  Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[4]  Mark C. Baker,et al.  Thematic Roles and Syntactic Structure , 1997 .

[5]  Ted Briscoe,et al.  Corpus Annotation for Parser Evaluation , 1999, ArXiv.

[6]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[7]  Xiaoyi Ma,et al.  BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[8]  David Yarowsky,et al.  Statistical Machine Translation: Final Report , 1999 .

[9]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[10]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[11]  Bonnie J. Dorr,et al.  Machine Translation: A View from the Lexicon , 1994, CL.

[12]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[13]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[14]  Philip Resnik,et al.  Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[15]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[16]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[17]  Mitchell Marcus,et al.  Empirical Methods for Exploiting Parallel Texts , 2001 .

[18]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[19]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[20]  Philip Resnik,et al.  Word-level Alignment for Multilingual Resource Acquisition , 2002 .

[21]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.