Completing Parsed Corpora

This paper describes the process of corpus annotation employed on ICE-GB, a one-million word parsed corpus of spoken and written English. Limitations of existing automatic annotation tools mean that, for the foreseeable future, completion of corpora will require manual intervention and correction. However, manual post-correction is labour-intensive and error-prone, requiring a very high degree of skill. In an earlier paper (Wallis and Nelson, 1997), we demonstrated that this problem far outweighs issues of tool design. We therefore proposed a paradigm shift from text-wise (longitudinal) to construction-based (transverse) correction. This permits the delicate problem of identifying and correcting errors in parses to be located within a context of similar problems across the corpus. Without removing the role of human judgment, we eased the difficulty level and increased the consistency of intervention. We discuss the technological requirements for this approach, its advantages and limitations.

[1]  Mark Huckvale,et al.  Out-of-vocabulary rate reduction through dispersion-based lexicon acquisition , 2000 .

[2]  Sean Wallis,et al.  Knowledge Discovery in Grammatically Analysed Corpora , 2001, Data Mining and Knowledge Discovery.

[3]  Bas Aarts,et al.  Exploring Natural Language: Working with the British Component of the International Corpus of English , 2002 .

[4]  Sidney Greenbaum,et al.  The Oxford English Grammar , 1996 .

[5]  Nelleke Oostdijk,et al.  Corpus Linguistics and the Automatic Analysis of English , 1991 .

[6]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[7]  John Sinclair,et al.  The automatic analysis of corpora , 1992 .

[8]  Claus Gnutzmann,et al.  Teaching and learning English as a global language : native and non-native perspectives , 1999 .

[9]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[10]  Sean Wallis,et al.  Exploiting fuzzy tree fragment queries in the investigation of parsed corpora , 2000 .

[11]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[12]  Sidney Greenbaum,et al.  Comparing English worldwide : the International Corpus of English , 1996 .

[13]  David M. Carter,et al.  The TreeBanker: a Tool for Supervised Training of Parsed Corpora , 1997, ArXiv.

[14]  Bas Aarts,et al.  Global resources for a global language: English language pedagogy in the modern age , 1999 .

[15]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[16]  Bas Aarts,et al.  Using fuzzy tree fragments to explore English grammar , 1998 .

[17]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[18]  Stig Johansson,et al.  English computer corpora : selected papers and research guide , 1991 .

[19]  Sean Wallis,et al.  Syntactic Parsing as a Knowledge Acquisition Problem , 1997, EKAW.