Growing a Spoken Language Interface on Amazon Mechanical Turk

Typically data collection, transcription, language model generation, and deployment are separate phases of creating a spoken language interface. An unfortunate consequence of this is that the recognizer usually remains a static element of systems often deployed in dynamic environments. By providing an API for human intelligence, Amazon Mechanical Turk changes the way system developers can construct spoken language systems. In this work, we describe an architecture that automates and connects these four phases, effectively allowing the developer to grow a spoken language interface. In particular, we show that a human-in-the-loop programming paradigm, in which workers transcribe utterances behind the scenes, can alleviate the need for expert guidance in language model construction. We demonstrate the utility of these organic language models in a voice-search interface for photographs.

[1]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[2]  Chris Callison-Burch,et al.  Creating Speech and Language Data With Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[3]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Ian McGraw,et al.  The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces , 2008, ICMI '08.

[5]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[6]  Chris Callison-Burch,et al.  Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription , 2010, NAACL.

[7]  Ian R. Lane,et al.  Tools for Collecting Speech Corpora via Mechanical-Turk , 2010, Mturk@HLT-NAACL.

[8]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[9]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[10]  Dong Yu,et al.  Maximizing global entropy reduction for active learning in speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Maxine Eskénazi,et al.  Doing research on a deployed spoken dialogue system: one year of let's go! experience , 2006, INTERSPEECH.

[12]  Lydia B. Chilton,et al.  TurKit: Tools for iterative tasks on mechanical turk , 2009, 2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[13]  Michael S. Bernstein,et al.  Soylent: a word processor with a crowd inside , 2010, UIST.

[14]  James R. Glass,et al.  Collecting Voices from the Cloud , 2010, LREC.

[15]  Ian McGraw,et al.  A self-transcribing speech corpus: collecting continuous speech with an online educational game , 2009, SLaTE.