Large-Scale Supervised Models for Noun Phrase Bracketing

Interpreting the structure of noun phrases (NPs) is important for many Natural Language Processing (NLP) tasks. This work extends the state-of-the-art in NP bracketing by: creating supervised models trained on a large annotated corpus; applying these to longer, more complex NPs; and using the resulting system to improve the output of the Bikel (2004) parser. Using a large corpus of manually annotated Penn Treebank NPs we have developed a supervised model that brackets simple NPs with 93.01% F-score. We extend the evaluation to include longer, more complex NPs that are rarely dealt with in the literature, attaining 91.44% F-score. Finally, we implement a post-processing module that brackets NPs identified by the Bikel (2004) parser, which outperforms the parser itself by 8.13% F-score.

[1]  Mitchell P. Marcus,et al.  A theory of syntactic recognition for natural language , 1979 .

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[4]  James Pustejovsky,et al.  Lexical Semantic Techniques for Corpus Analysis , 1993, CL.

[5]  P. Resnik Selection and information: a class-based approach to lexical relationships , 1993 .

[6]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[7]  Ken Barker,et al.  A Trainable Bracketer for Noun Modifiers , 1998, Canadian Conference on AI.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[10]  Mitchell P. Marcus,et al.  On the parameter space of generative lexicalized statistical parsing models , 2004 .

[11]  Frank Keller,et al.  The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks , 2004, NAACL.

[12]  Daniel Marcu,et al.  NP Bracketing by Maximum Entropy Tagging and SVM Reranking , 2004, EMNLP.

[13]  Preslav Nakov,et al.  A study of using search engine page hits as a proxy for n-gram frequencies , 2005 .

[14]  Dan I. Moldovan,et al.  On the semantics of noun compounds , 2005, Comput. Speech Lang..

[15]  Preslav Nakov,et al.  Search Engine Statistics Beyond the n-Gram: Application to Noun Compound Bracketing , 2005, CoNLL.

[16]  James R. Curran,et al.  Adding Noun Phrase Structure to the Penn Treebank , 2007, ACL.

[17]  Hal Daumé Notes on CG and LM-BFGS Optimization of Logistic Regression , 2008 .