Spoken Dialogue in Virtual Worlds

Human-computer conversations have attracted a great deal of interest especially in virtual worlds. In fact, research gave rise to spoken dialogue systems by taking advantage of speech recognition, language understanding and speech synthesis advances. This work surveys the state of the art of speech dialogue systems. Current dialogue system technologies and approaches are first introduced emphasizing differences between them, then, speech recognition and synthesis and language understanding are introduced as complementary and necessary modules. On the other hand, as the development of spoken dialogue systems becomes more complex, it is necessary to define some processes to evaluate their performance. Wizard-of-Oz techniques play an important role to achieve this task. Thanks to this technique is obtained a suitable dialogue corpus necessary to achieve good performance. A description of this technique is given in this work together with perspectives on multimodal dialogue systems in virtual worlds.

[1]  Richard A. Guedj,et al.  Human-machine interaction and digital signal processing , 1982, ICASSP.

[2]  James F. Allen,et al.  A Plan Recognition Model for Subdialogues in Conversations , 1987, Cogn. Sci..

[3]  C. Montacie,et al.  Temporal decomposition and acoustic-phonetic decoding of speech , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  M. Abe A segment-based approach to voice conversion , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Lynette Hirschman,et al.  Multi-Site Data Collection for a Spoken Language Corpus , 1992, HLT.

[6]  Stephanie Seneff Robust parsing for spoken language systems , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Joëlle Coutaz,et al.  Applying the Wizard of Oz Technique to the Study of Multimodal Systems , 1993, EWHCI.

[8]  Arne Jönsson,et al.  Wizard of Oz studies -- why and how , 1993, Knowl. Based Syst..

[9]  Steve Young,et al.  The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[10]  Hideki Noda,et al.  A parallel processing algorithm for speech recognition using markov random fields , 1994, Systems and Computers in Japan.

[11]  B. L. Zeigler,et al.  Dialog design for a speech-interactive automation system , 1994, Proceedings of 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications.

[12]  François Andry,et al.  Interleaving Syntax and Semantics in an Effecient Bottom-Up Parser , 1994, ACL.

[13]  Wayne H. Ward,et al.  Recent Improvements in the CMU Spoken Language Understanding System , 1994, HLT.

[14]  A. Damasio Descartes’ Error. Emotion, Reason and the Human Brain. New York (Grosset/Putnam) 1994. , 1994 .

[15]  A. Damasio Descartes' error: emotion, reason, and the human brain. avon books , 1994 .

[16]  E. Levin,et al.  CHRONUS, The next generation , 1995 .

[17]  Roberto Pieraccini,et al.  Concept-based spontaneous speech understanding system , 1995, EUROSPEECH.

[18]  Donald G. Childers,et al.  Glottal source modeling for voice conversion , 1995, Speech Commun..

[19]  Joseph Polifroni,et al.  A form-based dialogue manager for spoken language applications , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  P. P. Boda,et al.  From stochastic speech recognition to understanding: an HMM-based approach , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[21]  J. Rogers,et al.  The genus Stilbohypoxylon , 1997 .

[22]  Roberto Pieraccini,et al.  Learning dialogue strategies within the Markov decision process framework , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[23]  Yonghong Yan,et al.  Universal speech tools: the CSLU toolkit , 1998, ICSLP.

[24]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[25]  Laila Dybkjær,et al.  The disc approach to spoken language systems development and evaluation , 1998 .

[26]  Renato De Mori,et al.  Spoken Dialogues with Computers , 1998 .

[27]  Domitile Lourdeaux,et al.  A Theoretical Approach of the Design and Evaluation of a Virtual Reality Device , 1999 .

[28]  Gérard Chollet,et al.  Toward ALISP: A proposal for Automatic Language Independent Speech Processing , 1999 .

[29]  Marilyn A. Walker,et al.  Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.

[30]  Andy C. Downton,et al.  Parallel Structure in an Integrated Speech-Recognition Network , 1999, Euro-Par.

[31]  Josva Kleist,et al.  Migration = cloning; aliasing , 1999 .

[32]  Xiaojun Wu,et al.  Topic Forest: a plan-based dialog management structure , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[33]  Nathalie Revol,et al.  Parallelization of Automatic Speech Recognition , 2001 .

[34]  David Stallard Evaluation Results for the Talk'n'Travel System , 2001, HLT.

[35]  Wolfgang Wahlster,et al.  SmartKom: Towards Multimodal Dialogues with Anthropomorphic Interface Agents , 2001 .

[36]  Marilyn A. Walker,et al.  Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems , 2001, ACL.

[37]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[38]  Heiga Zen,et al.  AN HMM-BASED SPEECH SYNTHESIS SYSTEM APPLIED TO ENGLISH , 2003 .

[39]  Steve Young,et al.  A data-driven spoken language understanding system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[40]  Paul Lamere,et al.  Design of the CMU Sphinx-4 Decoder , 2022 .

[41]  Anne Rogers,et al.  Parallel Speech Recognition , 2004, International Journal of Parallel Programming.

[42]  Mark Steedman,et al.  APML, a Markup Language for Believable Behavior Generation , 2004, Life-like characters.

[43]  Keiichi Tokuda,et al.  Galatea: Open-Source Software for Developing Anthropomorphic Spoken Dialog Agents , 2004, Life-like characters.

[44]  Frédéric Béchet,et al.  Constitution d'un corpus de dialogue oral pour l'évaluation automatique de la compréhension hors- et en- contexte du dialogue , 2004 .

[45]  Michael F. McTear,et al.  An approach to multi-strategy dialogue management , 2005, INTERSPEECH.

[46]  Roberta Catizone,et al.  Multimodal Generation in the COMIC Dialogue System , 2005, ACL.

[47]  Wayne Wobcke,et al.  An agent-based approach to dialogue management in personal assistants , 2005, IUI.

[48]  Frédéric Béchet,et al.  Semantic interpretation with error correction , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[49]  Martin Rajman,et al.  Archivus: A Multimodal System for Multimedia Meeting Browsing and Retrieval , 2006, ACL.

[50]  Martin Rajman,et al.  Extending the Wizard of Oz Methodology for Language-enabled Multimodal Systems , 2006 .

[51]  Frédéric Béchet,et al.  Results of the French Evalda-Media evaluation campaign for literal understanding , 2006, LREC.

[52]  Rob A. Rutenbar,et al.  Moving speech recognition from software to silicon: the in silico vox project , 2006, INTERSPEECH.

[53]  Frédéric Landragin,et al.  Physical, semantic and pragmatic levels for multimodal fusion and fission , 2007 .

[54]  Sun-Yuan Kung,et al.  Environment adaptation for robust speaker verification by cascading maximum likelihood linear regression and reinforced learning , 2007, Comput. Speech Lang..

[55]  Rob A. Rutenbar,et al.  A 1000-word vocabulary, speaker-independent, continuous live-mode speech recognizer implemented in a single FPGA , 2007, FPGA '07.

[56]  Deb Roy,et al.  Situated Language Understanding as Filtering Perceived Affordances , 2007, Cogn. Sci..

[57]  Sadaoki Furui,et al.  The Titech large vocabulary WFST speech recognition system , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[58]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[59]  Nathalie Camelin Stratégies robustes de compréhension de la parole basées sur des méthodes de classification automatique , 2007 .

[60]  Kurt Keutzer,et al.  Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors , 2008 .

[61]  Gary Geunbae Lee,et al.  A Frame-Based Probabilistic Framework for Spoken Dialog Management Using Dialog Examples , 2008, SIGDIAL Workshop.

[62]  Jeff Orkin,et al.  The Restaurant Game: Learning Social Behavior and Language from Thousands of Players Online , 2008, J. Game Dev..

[63]  Pierre Dumouchel,et al.  GPU accelerated acoustic likelihood computations , 2008, INTERSPEECH.

[64]  Gary Geunbae Lee,et al.  Example-based dialog modeling for practical multi-domain dialog system , 2009, Speech Commun..

[65]  Sadaoki Furui,et al.  Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition , 2009, Comput. Speech Lang..

[66]  Gérard Chollet,et al.  Vocal Forgery in Forensic Sciences , 2009, e-Forensics.

[67]  Walid Karam,et al.  Talking-Face Identity Verification, Audiovisual Forgery, and Robustness Issues , 2009, EURASIP J. Adv. Signal Process..