Generalizing a Strongly Lexicalized Parser using Unlabeled Data

Statistical parsers trained on labeled data suffer from sparsity, both grammatical and lexical. For parsers based on strongly lexicalized grammar formalisms (such as CCG, which has complex lexical categories but simple combinatory rules), the problem of sparsity can be isolated to the lexicon. In this paper, we show that semi-supervised Viterbi-EM can be used to extend the lexicon of a generative CCG parser. By learning complex lexical entries for low-frequency and unseen words from unlabeled data, we obtain improvements over our supervised model for both indomain (WSJ) and out-of-domain (questions and Wikipedia) data. Our learnt lexicons when used with a discriminative parser such as C&C also significantly improve its performance on unseen words.

[1]  Joel Nothman,et al.  Evaluating a Statistical CCG Parser on Wikipedia , 2009, PWNLP@IJCNLP.

[2]  Valentin I. Spitkovsky,et al.  Viterbi Training Improves Unsupervised Dependency Parsing , 2010, CoNLL.

[3]  Noah A. Smith,et al.  Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization , 2010, ACL.

[4]  Stephen R. Clark,et al.  CLSP WS-02 Final Report: Semi-Supervised Training for Statistical Parsing , 2003 .

[5]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[6]  Martin Kay,et al.  Syntactic Process , 1979, ACL.

[7]  Tejaswini Deoskar,et al.  Re-estimation of Lexical Parameters for Treebank PCFGs , 2008, COLING.

[8]  Mark Steedman,et al.  Object-Extraction and Question-Parsing using CCG , 2004, EMNLP.

[9]  Mark Johnson,et al.  Representational Bias in Unsupervised Learning of Syllable Structure , 2005, CoNLL.

[10]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[11]  Stephen Clark,et al.  Adapting a Lexicalized-Grammar Parser to Contrasting Domains , 2008, EMNLP.

[12]  Julia Hockenmaier,et al.  Data and models for statistical parsing with combinatory categorial grammar , 2003 .

[13]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[14]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[15]  Matthew Lease,et al.  Parsing Biomedical Literature , 2005, IJCNLP.

[16]  Xavier Carreras,et al.  An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing , 2009, EMNLP.

[17]  Mark Steedman,et al.  Semi-supervised CCG Lexicon Extension , 2011, EMNLP.

[18]  Michael Collins,et al.  Three Generative, Lexicalised Models for Statistical Parsing , 1997, ACL.

[19]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[20]  Mark Steedman,et al.  Generative Models for Statistical Parsing with Combinatory Categorial Grammar , 2002, ACL.

[21]  Mark Steedman,et al.  Combined Distributional and Logical Semantics , 2013, TACL.

[22]  Slav Petrov,et al.  Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[23]  Khalil Sima'an,et al.  Learning Structural Dependencies of Words in the Zipfian Tail , 2011, J. Log. Comput..

[24]  Brian Roark,et al.  MAP adaptation of stochastic grammars , 2006, Comput. Speech Lang..

[25]  Mark Steedman,et al.  CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank , 2007, CL.

[26]  Xavier Carreras,et al.  Simple Semi-supervised Dependency Parsing , 2008, ACL.

[27]  Mark Steedman,et al.  Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL , 2011, EMNLP.

[28]  Eugene Charniak,et al.  Effective Self-Training for Parsing , 2006, NAACL.