Syntactic Chunking Across Different Corpora

Syntactic chunking has been a well-defined and well-studied task since its introduction in 2000 as the conll shared task. Though some efforts have been further spent on chunking performance improvement, the experimental data has been restricted, with few exceptions, to (part of) the Wall Street Journal data, as adopted in the shared task. It remains open how those successful chunking technologies could be extended to other data, which may differ in genre/domain and/or amount of annotation. In this paper we first train chunkers with three classifiers on three different data sets and test on four data sets. We also vary the size of training data systematically to show data requirements for chunkers. It turns out that there is no significant difference between those state-of-the-art classifiers; training on plentiful data from the same corpus (switchboard) yields comparable results to Wall Street Journal chunkers even when the underlying material is spoken; the results from a large amount of unmatched training data can be obtained by using a very modest amount of matched training data.

[1]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[2]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[3]  Miles Osborne,et al.  Shallow Parsing using Noisy and Non-Stationary Training Material , 2002, J. Mach. Learn. Res..

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[6]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[7]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[8]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[9]  Yuji Matsumoto,et al.  Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[10]  Xavier Carreras,et al.  Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[13]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  Xavier Carreras,et al.  Filtering-Ranking Perceptron Learning for Partial Parsing , 2005, Machine Learning.

[16]  Miles Osborne,et al.  Shallow Parsing as Part-of-Speech Tagging , 2000, CoNLL/LLL.

[17]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[18]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[19]  Tong Zhang,et al.  Text Chunking based on a Generalization of Winnow , 2002, J. Mach. Learn. Res..

[20]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.