Web-scale Surface and Syntactic n-gram Features for Dependency Parsing

We develop novel first- and second-order features for dependency parsing based on the Google Syntactic Ngrams corpus, a collection of subtree counts of parsed sentences from scanned books. We also extend previous work on surface $n$-gram features from Web1T to the Google Books corpus and from first-order to second-order, comparing and analysing performance over newswire and web treebanks. Surface and syntactic $n$-grams both produce substantial and complementary gains in parsing accuracy across domains. Our best system combines the two feature sets, achieving up to 0.8% absolute UAS improvements on newswire and 1.4% on web text.

[1]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[2]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[3]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[4]  Yue Zhang,et al.  Semi-Supervised Feature Transformation for Dependency Parsing , 2013, EMNLP.

[5]  Richard Johansson,et al.  Extended Constituent-to-Dependency Conversion for English , 2007, NODALIDA.

[6]  Kentaro Torisawa,et al.  Improving Dependency Parsing with Subtrees from Auto-Parsed Data , 2009, EMNLP.

[7]  Emily Pitler,et al.  Attacking Parsing Bottlenecks with Unlabeled Data and Relevant Factorizations , 2012, ACL.

[8]  Slav Petrov,et al.  Overview of the 2012 Shared Task on Parsing the Web , 2012 .

[9]  Dan Klein,et al.  Web-Scale Features for Full-Scale Parsing , 2011, ACL.

[10]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[11]  Kenneth Ward Church,et al.  Using Web-scale N-grams to Improve Base NP Parsing Performance , 2010, COLING.

[12]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[13]  Fernando Pereira,et al.  Online Learning of Approximate Dependency Parsing Algorithms , 2006, EACL.

[14]  Preslav Nakov,et al.  Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution , 2005, HLT.