Augmenting Conversational Agents with Ambient Acoustic Contexts

Conversational agents are rich in content today. However, they are entirely oblivious to users’ situational context, limiting their ability to adapt their response and interaction style. To this end, we explore the design space for a context augmented conversational agent, including analysis of input segment dynamics and computational alternatives. Building on these, we propose a solution that redesigns the input segment intelligently for ambient context recognition, achieved in a two-step inference pipeline. We first separate the non-speech segment from acoustic signals and then use a neural network to infer diverse ambient contexts. To build the network, we curated a public audio dataset through crowdsourcing. Our experimental results demonstrate that the proposed network can distinguish between 9 ambient contexts with an average F1 score of 0.80 with a computational latency of 3 milliseconds. We also build a compressed neural network for on-device processing, optimised for both accuracy and latency. Finally, we present a concrete manifestation of our solution in designing a context-aware conversational agent and demonstrate use cases.

[1]  Marco Aurisicchio,et al.  Understanding Affective Experiences with Conversational Agents , 2019, CHI.

[2]  Hunter Gehlbach,et al.  How an Artificially Intelligent Virtual Assistant Helps Students Navigate the Road to College , 2017 .

[3]  John Kim,et al.  Towards Interpersonal Assistants: Next-Generation Conversational Agents , 2019, IEEE Pervasive Computing.

[4]  Peter Fröhlich,et al.  Alexa, I'm in Need!: Investigating the Potential and Barriers of Voice Assistance Services for Social Work , 2019, MobileHCI.

[5]  Harry Shum,et al.  The Design and Implementation of XiaoIce, an Empathetic Social Chatbot , 2018, CL.

[6]  Chris Baber,et al.  Interactive speech technology: human factors issues in the application of speech input/output to computers , 1993 .

[7]  Salvatore Parise,et al.  Solving the crisis of immediacy: How digital technology can transform the customer experience , 2016 .

[8]  Hunter Gehlbach,et al.  How an Artificially Intelligent Virtual Assistant Helps Students Navigate the Road to College , 2017 .

[9]  Jennifer Zamora,et al.  Rise of the Chatbots: Finding A Place for Artificial Intelligence in India and US , 2017, IUI Companion.

[10]  Tanupriya Choudhury,et al.  Conversational commerce a new era of e-business , 2016, 2016 International Conference System Modeling & Advancement in Research Trends (SMART).

[11]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[12]  Gierad Laput,et al.  Ubicoustics: Plug-and-Play Acoustic Activity Recognition , 2018, UIST.

[13]  Shwetak N. Patel,et al.  Convey: Exploring the Use of a Context View for Chatbots , 2018, CHI.

[14]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[15]  Jean-Luc Gauvain,et al.  Optimization of RNN-Based Speech Activity Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Chulhong Min,et al.  Earables for Personal-Scale Behavior Analytics , 2018, IEEE Pervasive Computing.

[17]  Abigail Sellen,et al.  "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents , 2016, CHI.

[18]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[19]  Wei Pan,et al.  SoundSense: scalable sound sensing for people-centric applications on mobile phones , 2009, MobiSys '09.

[20]  Mark A. Neerincx,et al.  A Therapy System for Post-Traumatic Stress Disorder Using a Virtual Agent and Virtual Storytelling to Reconstruct Traumatic Memories , 2017, Journal of Medical Systems.

[21]  Dong Yu,et al.  Speech emotion recognition using deep neural network and extreme learning machine , 2014, INTERSPEECH.

[22]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[23]  Hossein Sameti,et al.  Speech activity detection using deep neural networks , 2017, 2017 Iranian Conference on Electrical Engineering (ICEE).

[24]  Dawei Liang,et al.  Audio-Based Activities of Daily Living (ADL) Recognition with Large-Scale Acoustic Embeddings from Online Videos , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[25]  Gerhard Tröster,et al.  AmbientSense: A real-time ambient sound recognition system for smartphones , 2013, 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops).

[26]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[27]  Cosmin Munteanu,et al.  An Information Behaviour-Based Approach to Virtual Doctor Design , 2019, MobileHCI.

[28]  Cecilia Mascolo,et al.  Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[29]  Gerhard Tröster,et al.  Recognizing Daily Life Context Using Web-Collected Audio Data , 2012, 2012 16th International Symposium on Wearable Computers.

[30]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[31]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Nicholas D. Lane,et al.  DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning , 2015, UbiComp.

[35]  Bin Yang,et al.  Multi-level attention model for weakly supervised audio classification , 2018, DCASE.

[36]  Harry Shum,et al.  From Eliza to XiaoIce: challenges and opportunities with social chatbots , 2018, Frontiers of Information Technology & Electronic Engineering.

[37]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[38]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[40]  Rana El Kaliouby,et al.  On the Future of Personal Assistants , 2016, CHI Extended Abstracts.

[41]  C. Baber,et al.  Developing interactive speech technology , 1993 .

[42]  David Frohlich,et al.  Computers and conversation , 1990 .