LinGO Redwoods A Rich and Dynamic Treebank for HPSG

The LinGO Redwoods initiative is a seed activity in the design and development of a new type of treebank. A treebank is a (typically hand-built) collection of natural language utterances and associated linguistic analyses; typical treebanks—as for example the widely recognized Penn Treebank (Marcus, Santorini, & Marcinkiewicz, 1993), the Prague Dependency Treebank (Hajic, 1998), or the German TiGer Corpus (Skut, Krenn, Brants, & Uszkoreit, 1997)—assign syntactic phrase structure or tectogrammatical dependency trees over sentences taken from a naturallyoccuring source, often newspaper text. Applications of existing treebanks fall into two broad categories: (i) use of an annotated corpus in empirical linguistics as a source of structured language data and distributional patterns and (ii) use of the treebank for the acquisition (e.g. using stochastic or machine learning approaches) and evaluation of parsing systems. While several mediumto large-scale treebanks exist for English (and some for other major languages), all pre-existing publicly available resources exhibit the following limitations: (i) the depth of linguistic information recorded in these treebanks is comparatively shallow, (ii) the design and format of linguistic representation in the treebank hard-wires a small, predefined range of ways in which information can be extracted from the treebank, and (iii) representations in existing treebanks are static and over the (often yearor decade-long) evolution of a large-scale treebank tend to fall behind theoretical advances in formal linguistics and grammatical representation. LinGO Redwoods aims at the development of a novel treebanking methodology, (i) rich in nature anddynamic in both (ii) the ways linguistic data can be retrieved from the treebank in varying granularity and (iii) the constant evolution and regular updating of the treebank itself, synchronized to the development of ideas in syntactic theory. Starting in October 2001, the project is aiming to build the foundations for this new type of treebank, develop a basic set of tools required for treebank construction and maintenance, and construct an initial set of 10,000 annotated trees to be distributed together with the tools under an open-source license. Building a largescale treebank, disseminating it, and positioning the corpus as a widely-accepted resource is a multi-year effort; the results of this seeding activity will serve as a proof of concept for the novel approach that is expected to enable the LinGO group at CSLI both to disseminate the approach to the wider academic and industrial audience and to secure appropriate funding for the realization and exploitation of a larger treebank. The purpose of publication at this early stage is three-fold: (i) to encourage feedback on the Redwoods approach from a broader academic audience, (ii) to facilitate exchange with related work at other sites, and (iii) to invite additional collaborators to contribute to the construction of the Redwoods treebank or start its exploitation as early-access versions become available. This paper is an updated version of an earlier project report published by Oepen, Callahan, Flickinger, and Manning (2002); changes over that version include more recent numbers on the current Redwoods development status, inclusion of an example of discriminator-based disambiguation, and minor adaptations and corrections in various parts of the discussion.

[1]  T. E. Harris,et al.  The Theory of Branching Processes. , 1963 .

[2]  A. Agresti An introduction to categorical data analysis , 1997 .

[3]  Ann A. Copestake,et al.  The ACQUILEX LKB: representation issues in semi-automatic acquisition of large lexicons , 1992, ANLP.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[6]  Eric Atwell Comparative evaluation of grammatical annotation models , 1996 .

[7]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[8]  Bob Carpenter,et al.  Probabilistic Parsing using Left Corner Language Models , 1997, IWPT.

[9]  Wojciech Skut,et al.  An Annotation Scheme for Free Word Order Languages , 1997, ANLP.

[10]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[11]  David M. Carter,et al.  The TreeBanker: a Tool for Supervised Training of Parsed Corpora , 1997, ArXiv.

[12]  Ted Briscoe,et al.  Parser evaluation: a survey and a new proposal , 1998, LREC.

[13]  Stephan Oepen,et al.  The (new) LKB system , 1999 .

[14]  Mark Johnson,et al.  Estimators for Stochastic “Unification-Based” Grammars , 1999, ACL.

[15]  Ulrich Callmeier,et al.  PET – a platform for experimentation with efficient HPSG processing techniques , 2000, Natural Language Engineering.

[16]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[17]  Gertjan van Noord,et al.  Alpino: Wide-coverage Computational Analysis of Dutch , 2000, CLIN.

[18]  Stephan Oepen,et al.  Measure for Measure: Parser Cross-fertilization - Towards Increased Component Comparability and Exchange , 2000, IWPT.

[19]  Stefanie Dipper Grammar-Based Corpus Annotation , 2000, COLING 2000.

[20]  Dan Flickinger,et al.  On building a more effcient grammar by exploiting types , 2000, Natural Language Engineering.

[21]  Jonas Kuhn,et al.  Ambiguity Management in Grammar Writing , 2004 .

[22]  Gertjan van Noord,et al.  Statistical Parsing of Dutch using Maximum Entropy Models with Feature Merging , 2001, NLPRS.

[23]  Alex Lascarides,et al.  An Algebra for Semantic Construction in Constraint-based Grammars , 2001, ACL.