Using Web-scale N-grams to Improve Base NP Parsing Performance

We use web-scale N-grams in a base NP parser that correctly analyzes 95.4% of the base NPs in natural text. Web-scale data improves performance. That is, there is no data like more data. Performance scales log-linearly with the number of parameters in the model (the number of unique N-grams). The web-scale N-grams are particularly helpful in harder cases, such as NPs that contain conjunctions.

[1]  James Pustejovsky,et al.  Lexical Semantic Techniques for Corpus Analysis , 1993, CL.

[2]  Adwait Ratnaparkhi,et al.  A maximum entropy model for parsing , 1994, ICSLP.

[3]  ChengXiang Zhai,et al.  Fast Statistical Parsing of Noun Phrases for Document Indexing , 1997, ANLP.

[4]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[5]  Hinrich Schütze,et al.  Prepositional Phrase Attachment without Oracles , 2007, Computational Linguistics.

[6]  Ken Barker,et al.  A Trainable Bracketer for Noun Modifiers , 1998, Canadian Conference on AI.

[7]  Christopher D. Manning,et al.  Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.

[8]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[9]  Fuchun Peng,et al.  Unsupervised query segmentation using generative language models and wikipedia , 2008, WWW.

[10]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[11]  James R. Curran,et al.  Adding Noun Phrase Structure to the Penn Treebank , 2007, ACL.

[12]  Preslav Nakov,et al.  Search Engine Statistics Beyond the n-Gram: Application to Noun Compound Bracketing , 2005, CoNLL.

[13]  David Vadas Large-Scale Supervised Models for Noun Phrase Bracketing , 2007 .

[14]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[15]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[16]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[17]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[18]  Mark Lauer,et al.  Corpus Statistics Meet the Noun Compound: Some Empirical Results , 1995, ACL.

[19]  Hang Li,et al.  A unified and discriminative model for query refinement , 2008, SIGIR '08.

[20]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[21]  M. Liberman,et al.  The Stress and Structure of Modified Noun Phrases in English , 1992 .

[22]  Heng Ji,et al.  New Tools for Web-Scale N-grams , 2010, LREC.

[23]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[24]  Qin Iris Wang,et al.  Learning Noun Phrase Query Segmentation , 2007, EMNLP.

[25]  Benjamin Rey,et al.  Generating query substitutions , 2006, WWW '06.

[26]  Rosie Jones,et al.  The Linguistic Structure of English Web-Search Queries , 2008, EMNLP.

[27]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[28]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[29]  Kenneth Ward Church,et al.  Coping with Syntactic Ambiguity or How to Put the Block in the Box on the Table , 1982, CL.

[30]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.