Bootstrapping parsers via syntactic projection across parallel texts

Broad coverage, high quality parsers are available for only a handful of languages. A prerequisite for developing broad coverage parsers for more languages is the annotation of text with the desired linguistic representations (also known as “treebanking”). However, syntactic annotation is a labor intensive and time-consuming process, and it is difficult to find linguistically annotated text in sufficient quantities. In this article, we explore using parallel text to help solving the problem of creating syntactic annotation in more languages. The central idea is to annotate the English side of a parallel corpus, project the analysis to the second language, and then train a stochastic analyzer on the resulting noisy annotations. We discuss our background assumptions, describe an initial study on the “projectability” of syntactic relations, and then present two experiments in which stochastic parsers are developed with minimal human intervention via projection from English.

[1]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[2]  Suzanne Stevenson,et al.  A Multilingual Paradigm for Automatic Verb Classification , 2002, ACL.

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Rebecca Hwa,et al.  Sample Selection for Statistical Parsing , 2004, CL.

[5]  Steven P. Abney Dependency Grammars and Context-Free Grammars , 1994 .

[6]  Srinivas Bangalore,et al.  Learning Dependency Translation Models as Collections of Finite-State Head Transducers , 2000, Computational Linguistics.

[7]  Frederick Jelinek,et al.  Exploiting Syntactic Structure for Language Modeling , 1998, ACL.

[8]  Giorgio Satta,et al.  Generalized Multitext Grammars , 2004, ACL.

[9]  Daniel Gildea,et al.  Loosely Tree-Based Alignment for Machine Translation , 2003, ACL.

[10]  Raymond J. Mooney,et al.  Learning Parse and Translation Decisions from Examples with Rich Context , 1997, ACL.

[11]  Rebecca Hwa,et al.  Sample Selection for Statistical Grammar Induction , 2000, EMNLP.

[12]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[13]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[14]  Mark Steedman,et al.  Bootstrapping statistical parsers from small datasets , 2003, EACL.

[15]  Otakar Smrz,et al.  Arabic Syntactic Trees: from Constituency to Dependency , 2003, EACL.

[16]  Andreas Stolcke,et al.  Structure and performance of a dependency language model , 1997, EUROSPEECH.

[17]  Jason Eisner,et al.  Learning Non-Isomorphic Tree Mappings for Machine Translation , 2003, ACL.

[18]  Heidi Fox,et al.  Phrasal Cohesion and Statistical Machine Translation , 2002, EMNLP.

[19]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[20]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[21]  Min Tang,et al.  Active Learning for Statistical Natural Language Parsing , 2002, ACL.

[22]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[23]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[24]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[25]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[26]  Philip Resnik,et al.  Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[27]  Noah A. Smith,et al.  Bilingual Parsing with Factored Estimation: Using English to Parse Korean , 2004, EMNLP.

[28]  Adwait Ratnaparkhi,et al.  Learning to Parse Natural Language with Maximum Entropy Models , 1999, Machine Learning.

[29]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[30]  Fred Karlsson,et al.  Constraint Grammar as a Framework for Parsing Running Text , 1990, COLING.

[31]  David Chiang,et al.  Two Statistical Parsing Models Applied to the Chinese Treebank , 2000, ACL 2000.

[32]  David M. Magerman Natural Language Parsing as Statistical Pattern Recognition , 1994, ArXiv.

[33]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars, with Application to Segmentation, Bracketing, and Alignment of Parallel Corpora , 1995, IJCAI.

[34]  Eugene Charniak,et al.  Immediate-Head Parsing for Language Models , 2001, ACL.

[35]  Jason Baldridge,et al.  Active learning for HPSG parse selection , 2003, CoNLL.

[36]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[37]  Fei Xia,et al.  Converting Dependency Structures to Phrase Structures , 2001, HLT.

[38]  Anoop Sarkar,et al.  Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[39]  Jun Wu,et al.  Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling , 2000, Comput. Speech Lang..