From Eliza to XiaoIce: challenges and opportunities with social chatbots

Conversational systems have come a long way since their inception in the 1960s. After decades of research and development, we have seen progress from Eliza and Parry in the 1960s and 1970s, to task-completion systems as in the Defense Advanced Research Projects Agency (DARPA) communicator program in the 2000s, to intelligent personal assistants such as Siri, in the 2010s, to today’s social chatbots like XiaoIce. Social chatbots’ appeal lies not only in their ability to respond to users’ diverse requests, but also in being able to establish an emotional connection with users. The latter is done by satisfying users’ need for communication, affection, as well as social belonging. To further the advancement and adoption of social chatbots, their design must focus on user engagement and take both intellectual quotient (IQ) and emotional quotient (EQ) into account. Users should want to engage with a social chatbot; as such, we define the success metric for social chatbots as conversation-turns per session (CPS). Using XiaoIce as an illustrative example, we discuss key technologies in building social chatbots from core chat to visual awareness to skills. We also show how XiaoIce can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses. As we become the first generation of humans ever living with artificial intelligenc (AI), we have a responsibility to design social chatbots to be both useful and empathetic, so they will become ubiquitous and help society as a whole.

[1]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[2]  Pascale Fung,et al.  Towards Empathetic Human-Robot Interactions , 2016, CICLing.

[3]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Xiaodong Liu,et al.  Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval , 2015, NAACL.

[6]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lexing Xie,et al.  SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[8]  Xiaodong He,et al.  A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems , 2015, WWW.

[9]  S. Franchi,et al.  Dialogues with colorful “personalities” of early AI , 1995 .

[10]  A. Maslow A Theory of Human Motivation , 1943 .

[11]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[12]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Young-Bum Kim,et al.  An overview of end-to-end language understanding and dialog management for personal digital assistants , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[14]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[15]  Rui Yan,et al.  Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System , 2016, SIGIR.

[16]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Maja J. Mataric,et al.  A Framework for Automatic Human Emotion Classification Using Emotion Profiles , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[21]  Gokhan Tur,et al.  Intent Determination and Spoken Utterance Classification , 2011 .

[22]  Ji-Rong Wen,et al.  An Inference Approach to Basic Level of Categorization , 2015, CIKM.

[23]  Hang Li,et al.  A Deep Architecture for Matching Short Texts , 2013, NIPS.

[24]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[25]  Angelika Bayer,et al.  Working With Emotional Intelligence , 2016 .

[26]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[27]  Alexander I. Rudnicky,et al.  Creating natural dialogs in the carnegie mellon communicator system , 1999, EUROSPEECH.

[28]  Zhe Gan,et al.  StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[30]  Gregory A. Sanders,et al.  Darpa Communicator Evaluation: Progress from 2000 to 2001 Darpa Communicator Evaluation: Progress from 2000 to 2001 , 2022 .

[31]  Gregory A. Sanders,et al.  DARPA communicator dialog travel planning systems: the june 2000 data collection , 2001, INTERSPEECH.

[32]  Alexander I. Rudnicky,et al.  Chatbot Evaluation and Database Expansion via Crowdsourcing , 2016 .

[33]  Xiang Li,et al.  StalemateBreaker: A Proactive Content-Introducing Approach to Automatic Human-Computer Conversation , 2016, IJCAI.

[34]  Joseph Weizenbaum,et al.  ELIZA—a computer program for the study of natural language communication between man and machine , 1966, CACM.

[35]  Richard S. Wallace,et al.  The Anatomy of A.L.I.C.E. , 2009 .

[36]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[37]  Marilyn A. Walker,et al.  The AT&t-DARPA communicator mixed-initiative spoken dialog system , 2000, INTERSPEECH.

[38]  Maxine Eskénazi,et al.  Let's go public! taking a spoken dialog system to the real world , 2005, INTERSPEECH.

[39]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[41]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Dilek Z. Hakkani-Tür,et al.  LET'S DISCOH: COLLECTING AN ANNOTATED OPEN CORPUSWITH DIALOGUE ACTS AND REWARD SIGNALS FOR NATURAL LANGUAGE HELPDESKS , 2006, 2006 IEEE Spoken Language Technology Workshop.

[43]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[44]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[45]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[46]  K. Colby Artificial paranoia; a computer simulation of paranoid processes , 1975 .

[47]  K. Murphy A critique of emotional intelligence : what are the problems and how can they be fixed? , 2006 .

[48]  Victor Zue,et al.  Multilingual spoken-language understanding in the MIT Voyager system , 1995, Speech Commun..

[49]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[50]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[51]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[52]  Alex Acero,et al.  Semantic Frame‐Based Spoken Language Understanding , 2011 .

[53]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[54]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[55]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.

[56]  Yuji Matsumoto,et al.  Emotion Classification Using Massive Examples Extracted from the Web , 2008, COLING.

[57]  Bowen Zhou,et al.  Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation , 2016, AAAI.

[58]  Firoj Alam,et al.  Annotating and modeling empathy in spoken conversations , 2017, Comput. Speech Lang..

[59]  Zhiyuan Liu,et al.  Neural Sentiment Classification with User and Product Attention , 2016, EMNLP.

[60]  Li Deng,et al.  Deep Learning for Image-to-Text Generation: A Technical Overview , 2017, IEEE Signal Processing Magazine.

[61]  Ruhi Sarikaya An overview of the system architecture and key components The Technology Behind Personal Digital Assistants , 2022 .

[62]  Wei Chu,et al.  Personalized ranking model adaptation for web search , 2013, SIGIR.

[63]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[64]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[65]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[66]  Eric Steven Atwell,et al.  Different measurement metrics to evaluate a chatbot system , 2007, HLT-NAACL 2007.

[67]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[68]  Stuart M. Shieber,et al.  Lessons from a restricted Turing test , 1994, CACM.

[69]  H. Gardner,et al.  Frames of Mind: The Theory of Multiple Intelligences , 1983 .

[70]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[71]  Li Deng,et al.  Speech-Centric Information Processing: An Optimization-Oriented Approach , 2013, Proceedings of the IEEE.

[72]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[73]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  D. Goleman Emotional Intelligence: Why It Can Matter More Than IQ , 1995 .