Robots that can hear, understand and talk

In this survey paper we analytically examine the state of the art in speech and natural language processing technologies, and one of their most promising applications in the robotics world as a user interface to facilitate human-robot interaction/communication and robot control by spoken natural language. Theoretical aspects of spoken language technology and the main bottlenecks in developing a conversational interface for a robot have been presented in depth with results found while searching the literature related to the major breakthroughs made in this field. In this study, we present a brief technical introduction to talk-active robots, and to discuss related future technical challenges and technical approaches used. Efforts have been made to highlight the limitations and missing directions of the research and development in the spoken language technology which are creating hurdles in the development of voice-active robots for real-world applications.

[1]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[2]  Arturo Espinosa-Romero,et al.  Talking to Godot: dialogue with a mobile robot , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  Shinichi Hirai,et al.  Realization of safety in a coexistent robotic system by information sharing , 1998, Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146).

[4]  Mark C. Torrance,et al.  Natural communication with robots , 1994 .

[5]  Rainer Bischoff,et al.  Natural Communication and Interaction with Humanoid Robots , 1999 .

[6]  C. Y. Thielman,et al.  Natural Language with Integrated Deictic and Graphic Gestures , 1989, HLT.

[7]  Yuichiro Anzai Human-robot-computer interaction: a new paradigm of research in robotics , 1993, Adv. Robotics.

[8]  A.-J. Baerveldt Cooperation between man and robot: interface and safety , 1992, [1992] Proceedings IEEE International Workshop on Robot and Human Communication.

[9]  Joshua G. Hale,et al.  Using Humanoid Robots to Study Human Behavior , 2000, IEEE Intell. Syst..

[10]  Richard M. Schwartz,et al.  The N-Best Algorithm: Efficient Procedure for Finding Top N Sentence Hypotheses , 1989, HLT.

[11]  Mikio Nakano,et al.  Understanding Unsegmented User Utterances in Real-Time Spoken Dialogue Systems , 1999, ACL.

[12]  David L. Thomson,et al.  User Confusion in Natural Language Services , 2000 .

[13]  Illah R. Nourbakhsh,et al.  The role of expressiveness and attention in human-robot interaction , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[14]  Robert Dale,et al.  Building Natural Language Generation Systems: Figures , 2000 .

[15]  Satoshi Nakamura,et al.  Distant-talking speech recognition based on a 3-D Viterbi search using a microphone array , 2002, IEEE Trans. Speech Audio Process..

[16]  Giridharan Iyengar,et al.  Large-vocabulary audio-visual speech recognition by machines and humans , 2001, INTERSPEECH.

[17]  Wayne H. Ward,et al.  A word graph interface for a flexible concept based speech understanding framework , 2001, INTERSPEECH.

[18]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[19]  Stephanie Seneff,et al.  TINA: A Natural Language System for Spoken Language Applications , 1992, Comput. Linguistics.

[20]  C. A. Ferguson,et al.  Talking to Children: Language Input and Acquisition , 1979 .

[21]  Kristinn R. Thórisson,et al.  Mind Model for Multimodal Communicative Creatures and Humanoids , 1999, Appl. Artif. Intell..

[22]  Cynthia Breazeal,et al.  Regulation and Entrainment in Human—Robot Interaction , 2000, Int. J. Robotics Res..

[23]  Patrick Suppes,et al.  Language and Learning for Robots , 1994 .

[24]  Kiyohiro Shikano,et al.  Interface for Barge-in Free Spoken Dialogue System Using Sound Field Control and Microphone Array , 2002 .

[25]  Kerstin Dautenhahn,et al.  The Art of Designing Socially Intelligent Agents: Science, Fiction, and the Human in the Loop , 1998, Appl. Artif. Intell..

[26]  Sadaoki Furui,et al.  Toward flexible speech recognition-recent progress at Tokyo Institute of Technology , 2001, Canadian Conference on Electrical and Computer Engineering 2001. Conference Proceedings (Cat. No.01TH8555).

[27]  Roberto Pieraccini,et al.  Using Markov decision process for learning dialogue strategies , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Daniel W. E. Schobben Real-time Adaptive Concepts in Acoustics , 2001 .

[29]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[30]  Mark Steedman Speech, Place, and Action , 1982 .

[31]  Richard Wright,et al.  Prosody and phonetic variability: Lessons learned from acoustic model clustering , 2003 .

[32]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[33]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[34]  David Stallard,et al.  Syntactic and Semantic Knowledge in the DELPHI Unification Grammar , 1990, HLT.

[35]  Wayne H. Ward,et al.  The CMU Air Travel Information Service: Understanding Spontaneous Speech , 1990, HLT.

[36]  James Trevelyan,et al.  Redefining Robotics for the New Millennium , 1999, Int. J. Robotics Res..

[37]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[38]  Kiyohiro Shikano,et al.  Problems in Blind Separation of Convolutive Speech Mixtures by Negentropy Maximization , 2003 .

[39]  Karen A. Frenkel,et al.  Robots, machines in man's image , 1985 .

[40]  Victor Zue,et al.  Conversational interfaces: advances and challenges , 1997, Proceedings of the IEEE.

[41]  Dolores Cañamero,et al.  Modeling motivations and emotions as a basis for intelligent behavior , 1997, AGENTS '97.

[42]  Yasushi Nakauchi,et al.  A Social Robot that Stands in Line , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[43]  Ben J. A. Kröse,et al.  Jijo-2: An Office Robot that Communicates and Learns , 2001, IEEE Intell. Syst..

[44]  Norman I. Badler,et al.  Simulating humans: computer graphics animation and control , 1993 .

[45]  Jean-Louis Deneubourg,et al.  The dynamics of collective sorting robot-like ants and ant-like robots , 1991 .

[46]  David B. Levine,et al.  The Khepera robot and the kRobot class: a platform for introducing robotics in the undergraduate curriculum , 2001, SIGCSE '01.

[47]  Richard Sproat Multilingual text analysis for text-to-speech synthesis , 1996, Nat. Lang. Eng..

[48]  N. Badler,et al.  Linguistic Issues in Facial Animation , 1991 .

[49]  Johanna D. Moore,et al.  Planning Text for Advisory Dialogues: Capturing Intentional and Rhetorical Information , 1993, CL.

[50]  Guy J. Brown,et al.  A comparison of auditory and blind separation techniques for speech segregation , 2001, IEEE Trans. Speech Audio Process..

[51]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[52]  Stanley Peters,et al.  A multi-modal dialogue system for human-robot conversation , 2001, HTL 2001.

[53]  Jay G. Wilpon,et al.  A study of speech recognition for children and the elderly , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[54]  Alexander H. Waibel,et al.  The effects of room acoustics on MFCC speech parameter , 2000, INTERSPEECH.

[55]  Kiyohiro Shikano,et al.  Interface for barge-in free spoken dialogue system based on sound field control and microphone array , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[56]  Robert Malone,et al.  The Robot Book , 1978 .

[57]  Tsutomu Miyasato,et al.  Physical Constraints on Human Robot Interaction , 1999, IJCAI.

[58]  Gerhard Lakemeyer,et al.  A Speech Interface for a Mobile Robot controlled by GOLOG , 2000 .

[59]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[60]  Takanori Shibata,et al.  Physical and affective interaction between human and mental commit robot , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[61]  Cynthia Breazeal Emotive qualities in robot speech , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[62]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[63]  Niels Ole Bernsen,et al.  Cooperativity in human‐machine and human‐human spoken dialogue , 1996 .

[64]  David Nunan,et al.  Introducing discourse analysis , 1993 .

[65]  W. S. Reilly,et al.  Believable Social and Emotional Agents. , 1996 .

[66]  Sohail Inayatullah,et al.  The rights of robots: Technology, culture and law in the 21st century , 1988 .

[67]  Kazuhiko Kawamura,et al.  Multi-agent system for a human-friendly robot , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[68]  Tomoki Toda,et al.  High quality voice conversion based on Gaussian mixture model with dynamic frequency warping , 2001, INTERSPEECH.

[69]  Kerstin Dautenhahn,et al.  ROBOTS AS SOCIAL ACTORS: AURORA AND THE CASE OF AUTISM , 1999 .

[70]  Yoshinori Kuno,et al.  Human-robot interface based on the mutual assistance between speech and vision , 2001, PUI '01.

[71]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[72]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[73]  Thomas S. Huang,et al.  BattleView: A Multimodal HCI Research Application , 1998 .

[74]  Mari Ostendorf,et al.  The impact of speech recognition on speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[75]  Atsuo Takanishi,et al.  Mechanical design of a talking robot for natural vowels and consonant sounds , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[76]  Tsutomu Miyasato,et al.  Multimodal human emotion/expression recognition , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[77]  Martin J. Russell,et al.  Why is automatic recognition of children's speech difficult? , 2001, INTERSPEECH.

[78]  Kiyohiro Shikano,et al.  ASKA: receptionist robot with speech dialogue system , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[79]  Wolfgang Wahlster,et al.  KANTRA - A Natural Language Interface for Intelligent Robots , 2003 .

[80]  Jean-Louis Deneubourg,et al.  From local actions to global tasks: stigmergy and collective robotics , 2000 .

[81]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[82]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[83]  Douglas E. Appelt,et al.  GEMINI: A Natural Language System for Spoken-Language Understanding , 1993, ACL.

[84]  Alex Waibel,et al.  Prosody and speech recognition , 1988 .

[85]  Yannis Stylianou,et al.  A system for voice conversion based on probabilistic classification and a harmonic plus noise model , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[86]  Jon Rigelsford,et al.  Behaviour‐based Robotics , 2001 .