Spoken Language Interaction with Robots: Research Issues and Recommendations, Report from the NSF Future Directions Workshop

Abstract With robotics rapidly advancing, more effective human–robot interaction is increasingly needed to realize the full potential of robots for society. While spoken language must be part of the solution, our ability to provide spoken language interaction capabilities is still very limited. In this article, based on the report of an interdisciplinary workshop convened by the National Science Foundation, we identify key scientific and engineering advances needed to enable effective spoken language interaction with robotics. We make 25 recommendations, involving eight general themes: putting human needs first, better modeling the social and interactive aspects of language, improving robustness, creating new methods for rapid adaptation, better integrating speech and language with other communication modalities, giving speech and language components access to rich representations of the robot’s current knowledge and state, making all components operate in real time, and improving research infrastructure and resources. Research and development that prioritizes these topics will, we believe, provide a solid foundation for the creation of speech-capable robots that are easy and effective for humans to work with.

[1]  Tatsuya Kawahara,et al.  Spoken Language Interaction with Virtual Agents and Robots (SLIVAR): Towards Effective and Ethical Interaction (Dagstuhl Seminar 20021) , 2020, Dagstuhl Reports.

[2]  Mike Phillips APPLICATIONS OF SPOKEN LANGUAGE TECHNOLOGY AND SYSTEMS , 2006, 2006 IEEE Spoken Language Technology Workshop.

[3]  Bilge Mutlu,et al.  Embodiment in Socially Interactive Robots , 2019, Found. Trends Robotics.

[4]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.

[5]  Claudio Castellini,et al.  Robotic interfaces for cognitive psychology and embodiment research: A research roadmap. , 2018, Wiley interdisciplinary reviews. Cognitive science.

[6]  Eric Horvitz,et al.  Directions robot: in-the-wild experiences and lessons learned , 2014, AAMAS.

[7]  Yolanda Gil,et al.  A 20-Year Community Roadmap for Artificial Intelligence Research in the US , 2019, ArXiv.

[8]  Alexander L. Francis,et al.  Identifying a temporal threshold of tolerance for silent gaps after requests. , 2013, The Journal of the Acoustical Society of America.

[9]  Ning Wang,et al.  Can Virtual Humans Be More Engaging Than Real Ones? , 2007, HCI.

[10]  Tiancheng Zhao,et al.  Report from the NSF Future Directions Workshop, Toward User-Oriented Agents: Research Directions and Challenges , 2020, ArXiv.

[11]  Michael J. Reddy Metaphor and Thought: The conduit metaphor: A case of frame conflict in our language about language , 1993 .

[12]  Sean Andrist,et al.  Rapid development of multimodal interactive systems: a demonstration of platform for situated intelligence , 2017, ICMI.

[13]  M. Boltz Temporal Dimensions of Conversational Interaction , 2005 .

[14]  Alois Knoll,et al.  Social behavior recognition using body posture and head pose for human-robot interaction , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Tatsuya Kawahara Spoken Dialogue System for a Human-like Conversational Robot ERICA , 2018, IWSDS.

[16]  Brian Scassellati,et al.  The Benefits of Interactions with Physically Present Robots over Video-Displayed Agents , 2011, Int. J. Soc. Robotics.

[17]  Tomoaki Nakamura,et al.  Survey on frontiers of language and robotics , 2019, Adv. Robotics.

[18]  Michael F. McTear,et al.  Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots , 2020, Conversational AI.

[19]  T. Lien Robot , 2020, Definitions.

[20]  Kallirroi Georgila,et al.  SimSensei kiosk: a virtual human interviewer for healthcare decision support , 2014, AAMAS.

[21]  Eric Horvitz,et al.  On the Challenges and Opportunities of Physically Situated Dialog , 2010, AAAI Fall Symposium: Dialog with Robots.

[22]  David McNeill,et al.  rrSDS: Towards a Robot-ready Spoken Dialogue System , 2020, SIGDIAL.

[23]  Riccardo Fusaroli,et al.  Investigating Conversational Dynamics: Interactive Alignment, Interpersonal Synergy, and Collective Task Performance , 2016, Cogn. Sci..

[24]  Gabriel Skantze,et al.  Turn-taking in Conversational Systems and Human-Robot Interaction: A Review , 2021, Comput. Speech Lang..

[25]  Tomás Svoboda,et al.  TRADR Project: Long-Term Human-Robot Teaming for Robot Assisted Disaster Response , 2015, KI - Künstliche Intelligenz.

[26]  Nancy M. Amato,et al.  A Roadmap for US Robotics - From Internet to Robotics 2020 Edition , 2021, Found. Trends Robotics.

[27]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[28]  Antonio Roque,et al.  Let’s do that first! A Comparative Analysis of Instruction-Giving in Human-Human and Human-Robot Situated Dialogue , 2020 .

[29]  Roger K. Moore Appropriate Voices for Artefacts: Some Key Insights , 2017 .

[30]  Hadas Kress-Gazit,et al.  Robots That Use Language , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[31]  Abigail Sellen,et al.  "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents , 2016, CHI.

[32]  Paul N. Bennett,et al.  Guidelines for Human-AI Interaction , 2019, CHI.

[33]  Ketan Mayer-Patel,et al.  Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps , 2019, ArXiv.

[34]  Zhou Yu,et al.  Incremental Coordination: Attention-Centric Speech Production in a Physically Situated Conversational Agent , 2015, SIGDIAL Conference.

[35]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[36]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[37]  Vijay Kumar,et al.  The grand challenges of Science Robotics , 2018, Science Robotics.

[38]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[39]  B. Scassellati,et al.  Social eye gaze in human-robot interaction , 2017, J. Hum. Robot Interact..

[40]  Toshihiko Itoh,et al.  Subjective experiments on influence of response timing in spoken dialogues , 2009, INTERSPEECH.

[41]  Deborah Tannen,et al.  That's not what I meant! : how conversational style makes or breaks relationships , 2011 .

[42]  Stephen M. Fiore,et al.  Towards Modeling Social-Cognitive Mechanisms in Robots to Facilitate Human-Robot Teaming , 2013, Proceedings of the Human Factors and Ergonomics Society Annual Meeting.

[43]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[44]  David DeVault,et al.  Challenges in Building Highly-Interactive Dialog Systems , 2017, AI Mag..

[45]  Joyce Yue Chai,et al.  Embodied Collaborative Referring Expression Generation in Situated Human-Robot Interaction , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[46]  Tetsunori Kobayashi,et al.  Four-participant group conversation: A facilitation robot controlling engagement density as the fourth participant , 2015, Comput. Speech Lang..

[47]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[48]  Roger K. Moore,et al.  Robot, Alien and Cartoon Voices: Implications for Speech-Enabled Systems , 2017 .

[49]  Gabriel Skantze,et al.  Furhat: A Back-Projected Human-Like Robot Head for Multiparty Human-Machine Interaction , 2011, COST 2102 Training School.

[50]  Nigel Ward,et al.  Spoken Language Interaction with Robots: Research Issues and Recommendations, Report from the NSF Future Directions Workshop , 2020, ArXiv.

[51]  Roger K. Moore Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction , 2016, IWSDS.

[52]  Matthias Peissner,et al.  Voice User Interface Design , 2004, UP.

[53]  Thomas B. Sheridan,et al.  Human–Robot Interaction , 2016, Hum. Factors.

[54]  Roger K. Moore,et al.  From Talking and Listening Robots to Intelligent Communicative Machines , 2014 .

[55]  Alexander I. Rudnicky,et al.  Miscommunication Detection and Recovery in Situated Human–Robot Dialogue , 2019, ACM Trans. Interact. Intell. Syst..

[56]  Klaus-Peter Engelbrecht,et al.  A taxonomy of quality of service and Quality of Experience of multimodal human-machine interaction , 2009, 2009 International Workshop on Quality of Multimedia Experience.

[57]  Changsong Liu,et al.  Collaborative Effort towards Common Ground in Situated Human-Robot Dialogue , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).