Crowdsourcing the acquisition of natural language corpora: Methods and observations

We study the opportunity for using crowdsourcing methods to acquire language corpora for use in natural language processing systems. Specifically, we empirically investigate three methods for eliciting natural language sentences that correspond to a given semantic form. The methods convey frame semantics to crowd workers by means of sentences, scenarios, and list-based descriptions. We discuss various performance measures of the crowdsourcing process, and analyze the semantic correctness, naturalness, and biases of the collected language. We highlight research challenges and directions in applying these methods to acquire corpora for natural language processing applications.

[1]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[3]  V. Aleven,et al.  Rapid Authoring of Intelligent Tutors for Real-World and Experimental Use , 2006, Sixth IEEE International Conference on Advanced Learning Technologies (ICALT'06).

[4]  Yi Zhu,et al.  Collection of user judgments on spoken dialog system with crowdsourcing , 2010, 2010 IEEE Spoken Language Technology Workshop.

[5]  Roland Reagan THE CU COMMUNICATOR SYSTEM , 1998 .

[6]  Stanley Peters,et al.  A wizard of oz framework for collecting spoken human-computer dialogs , 2004, INTERSPEECH.

[7]  Ian R. Lane,et al.  Tools for Collecting Speech Corpora via Mechanical-Turk , 2010, Mturk@HLT-NAACL.

[8]  Eric K. Ringger,et al.  A Robust System for Natural Spoken Dialogue , 1996, ACL.

[9]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[10]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[11]  Benno Stein,et al.  Paraphrase acquisition via crowdsourcing and machine learning , 2013, TIST.

[12]  Giuseppe Riccardi,et al.  How may I help you? , 1997, Speech Commun..

[13]  Anton Leuski,et al.  From domain specification to virtual humans: an integrated approach to authoring tactical questioning characters , 2008, INTERSPEECH.

[14]  Wayne H. Ward,et al.  THE CU COMMUNICATOR SYSTEM 1 , 1999 .