Subdomain Sensitive Statistical Parsing using Raw Corpora

Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.

[1]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[2]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[3]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[4]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[5]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[6]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[7]  Jochen Peters,et al.  Semantic clustering for adaptive language modeling , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[9]  Weblog Wikipedia,et al.  In Wikipedia the Free Encyclopedia , 2005 .

[10]  Rens Bod,et al.  Context-sensitive spoken dialogue processing with the DOP model , 1999, Natural Language Engineering.

[11]  B. Hawkins,et al.  A framework: , 2020, Harmful Interaction between the Living and the Dead in Greek Tragedy.

[12]  Satoshi Sekine,et al.  The Domain Dependence of Parsing , 1997, ANLP.

[13]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[14]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[15]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.