Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment

This paper explores the task of building an accurate prepositional phrase attachment corpus for new genres while avoiding a large investment in terms of time and money by crowd-sourcing judgments. We develop and present a system to extract prepositional phrases and their potential attachments from ungrammatical and informal sentences and pose the subsequent disambiguation tasks as multiple choice questions to workers from Amazon's Mechanical Turk service. Our analysis shows that this two-step approach is capable of producing reliable annotations on informal and potentially noisy blog text, and this semi-automated strategy holds promise for similar annotation projects in new genres.

[1]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Prepositional Phrase Attachment , 1994, HLT.

[2]  Xavier Carreras,et al.  Non-Projective Parsing for Statistical Machine Translation , 2009, EMNLP.

[3]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[4]  Joseph Kaye,et al.  Understanding how bloggers feel: recognizing affect in blog posts , 2006, CHI Extended Abstracts.

[5]  Ian McGraw,et al.  A self-transcribing speech corpus: collecting continuous speech with an online educational game , 2009, SLaTE.

[6]  Walter Daelemans,et al.  Resolving PP attachment Ambiguities with Memory-Based Learning , 1997, CoNLL.

[7]  Alexander S. Yeh,et al.  Some Properties of Preposition and Subordinate Conjunction Attachments , 1998, COLING-ACL.

[8]  John Blitzer,et al.  Frustratingly Hard Domain Adaptation for Dependency Parsing , 2007, EMNLP.

[9]  Kevin Knight,et al.  Synchronous Tree Adjoining Machine Translation , 2009, EMNLP.

[10]  Jacob Andreas,et al.  Towards Semi-Automated Annotation for Prepositional Phrase Attachment , 2010, LREC.

[11]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, WWW '04.

[12]  Frederick Jelinek,et al.  Structured Language Modeling for Speech Recognition , 2000, ArXiv.

[13]  Brian D. Davison,et al.  A classification-based approach to question answering in discussion boards , 2009, SIGIR.

[14]  Makoto Nagao,et al.  Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary , 1997, VLC.

[15]  Brian Roark,et al.  Discriminative Syntactic Language Modeling for Speech Recognition , 2005, ACL.