Optimizing Interactive Systems via Data-Driven Objectives

Effective optimization is essential for real-world interactive systems to provide a satisfactory user experience in response to changing user behavior. However, it is often challenging to find an objective to optimize for interactive systems (e.g., policy learning in task-oriented dialog systems). Generally, such objectives are manually crafted and rarely capture complex user needs in an accurate manner. We propose an approach that infers the objective directly from observed user interactions. These inferences can be made regardless of prior knowledge and across different types of user behavior. We introduce Interactive System Optimizer (ISO), a novel algorithm that uses these inferred objectives for optimization. Our main contribution is a new general principled approach to optimizing interactive systems using data-driven objectives. We demonstrate the high effectiveness of ISO over several simulations.

[1]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[2]  K. Arrow,et al.  The New Palgrave Dictionary of Economics , 2020 .

[3]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[4]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[5]  Julie A. Shah,et al.  Human-Machine Collaborative Optimization via Apprenticeship Scheduling , 2018, J. Artif. Intell. Res..

[6]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[7]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[8]  Michael C. Yip,et al.  Adversarial Imitation via Variational Inverse Reinforcement Learning , 2018, ICLR.

[9]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[10]  Stefan Ultes,et al.  Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management , 2017, SIGDIAL Conference.

[11]  Leif Azzopardi,et al.  Modelling interaction with economic models of search , 2014, SIGIR.

[12]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[13]  Kam-Fai Wong,et al.  Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[15]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[16]  Mounia Lalmas,et al.  Absence time and user engagement: evaluating ranking functions , 2013, WSDM '13.

[17]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[18]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[19]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[20]  Brian D. Ziebart,et al.  Intent Prediction and Trajectory Forecasting via Predictive Inverse Linear-Quadratic Regulation , 2015, AAAI.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[23]  Maarten de Rijke,et al.  Dynamic Query Modeling for Related Content Finding , 2015, SIGIR.

[24]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[25]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[26]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[27]  Stephen C. Adams,et al.  Multi-agent Inverse Reinforcement Learning for Certain General-sum Stochastic Games , 2019, J. Artif. Intell. Res..

[28]  Grace Hui Yang,et al.  Session Search by Direct Policy Learning , 2015, ICTIR.

[29]  Manuel Lopes,et al.  Active Learning for Reward Estimation in Inverse Reinforcement Learning , 2009, ECML/PKDD.

[30]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[31]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[32]  Madian Khabsa,et al.  Detecting Good Abandonment in Mobile Search , 2016, WWW.

[33]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[34]  Bing Liu,et al.  Adversarial Learning of Task-Oriented Neural Dialog Models , 2018, SIGDIAL Conference.

[35]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[36]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[37]  Joelle Pineau,et al.  Bootstrapping Dialog Systems with Word Embeddings , 2014 .

[38]  Sergey Levine,et al.  Feature Construction for Inverse Reinforcement Learning , 2010, NIPS.

[39]  Imed Zitouni,et al.  Understanding User Satisfaction with Intelligent Assistants , 2016, CHIIR.

[40]  Liang Zhang,et al.  Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning , 2018, KDD.

[41]  Chen Cui,et al.  User Attention-guided Multimodal Dialog Systems , 2019, SIGIR.

[42]  M. de Rijke,et al.  A I ] 8 M ay 2 01 8 Optimizing Interactive Systems with Data-Driven Objectives , 2018 .

[43]  Xuanjing Huang,et al.  Towards Diverse Text Generation with Inverse Reinforcement Learning , 2018, ArXiv.

[44]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[45]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[46]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[47]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[48]  Anca D. Dragan,et al.  Learning from Extrapolated Corrections , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[49]  J. Perner,et al.  Development of theory of mind and executive control , 1999, Trends in Cognitive Sciences.

[50]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[51]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[52]  Jiliang Tang,et al.  A Survey on Dialogue Systems: Recent Advances and New Frontiers , 2017, SKDD.

[53]  Kam-Fai Wong,et al.  Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[54]  Jianfeng Gao,et al.  Guided Dialog Policy Learning without Adversarial Learning in the Loop , 2020, EMNLP.

[55]  Rafael E. Banchs Movie-DiC: a Movie Dialogue Corpus for Research and Development , 2012, ACL.

[56]  Jianfeng Gao,et al.  Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning , 2018, EMNLP.

[57]  Geoffrey Zweig,et al.  Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[58]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[59]  Eelco Herder,et al.  Web page revisitation revisited: implications of a long-term click-stream study of browser usage , 2007, CHI.

[60]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[61]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[62]  W. Bruce Croft,et al.  Neural Ranking Models with Weak Supervision , 2017, SIGIR.

[63]  M. de Rijke,et al.  The Impact of Linkage Methods in Hierarchical Clustering for Active Learning to Rank , 2017, SIGIR.

[64]  Richard S. Zemel,et al.  SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies , 2019, NeurIPS.

[65]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[66]  David Vandyke,et al.  Policy committee for adaptation in multi-domain spoken dialogue systems , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[67]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[68]  M. de Rijke,et al.  Evaluating Personal Assistants on Mobile devices , 2017, ArXiv.

[69]  Olivier Pietquin,et al.  Inverse reinforcement learning for interactive systems , 2013, MLIS '13.

[70]  Anca D. Dragan,et al.  Reward-rational (implicit) choice: A unifying formalism for reward learning , 2020, NeurIPS.

[71]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[72]  Ed H. Chi,et al.  Top-K Off-Policy Correction for a REINFORCE Recommender System , 2018, WSDM.

[73]  Milica Gasic,et al.  Gaussian Processes for POMDP-Based Dialogue Manager Optimization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[74]  Kam-Fai Wong,et al.  Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[75]  Diane Kelly When Effort Exceeds Expectations: A Theory of Search Task Difficulty (keynote) , 2015, SCST@ECIR.

[76]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[77]  Victor Zue,et al.  Conversational interfaces: advances and challenges , 1997, Proceedings of the IEEE.

[78]  Jianfeng Gao,et al.  BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems , 2016, AAAI.

[79]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[80]  M. de Rijke,et al.  Conversations with Documents: An Exploration of Document-Centered Assistance , 2020, CHIIR.

[81]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[82]  Ryen W. White,et al.  Evaluating implicit feedback models using searcher simulations , 2005, TOIS.

[83]  Madian Khabsa,et al.  Is This Your Final Answer?: Evaluating the Effect of Answers on Good Abandonment in Mobile Search , 2016, SIGIR.

[84]  Frank Lovett Rational Choice Theory and Explanation , 2006 .

[85]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[86]  Wei-Ying Ma,et al.  Topic Aware Neural Response Generation , 2016, AAAI.

[87]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[88]  Thorsten Joachims,et al.  Shaping Feedback Data in Recommender Systems with Interventions Based on Information Foraging Theory , 2019, WSDM.

[89]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[90]  Alan Ritter,et al.  Data-Driven Response Generation in Social Media , 2011, EMNLP.

[91]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[92]  M. de Rijke,et al.  Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning , 2018, AAAI.

[93]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[94]  Minlie Huang,et al.  Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog , 2019, EMNLP.

[95]  Hal R. Varian,et al.  Economics and search , 1999, SIGF.

[96]  Grace Hui Yang,et al.  Learning to Reinforce Search Effectiveness , 2015, ICTIR.

[97]  Katja Hofmann,et al.  Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[98]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[99]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[100]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[101]  Sungjin Lee,et al.  ConvLab: Multi-Domain End-to-End Dialog System Platform , 2019, ACL.

[102]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[103]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[104]  Eric Crestan,et al.  Modelling and Detecting Changes in User Satisfaction , 2014, CIKM.

[105]  Nicholas Jing Yuan,et al.  DRN: A Deep Reinforcement Learning Framework for News Recommendation , 2018, WWW.

[106]  Anind K. Dey,et al.  Probabilistic pointing target prediction via inverse optimal control , 2012, IUI '12.

[107]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[108]  Imed Zitouni,et al.  Predicting User Satisfaction with Intelligent Assistants , 2016, SIGIR.

[109]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[110]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[111]  Gleb Gusev,et al.  Engagement Periodicity in Search Engine Usage: Analysis and its Application to Search Quality Evaluation , 2015, WSDM.

[112]  Antoine Raux,et al.  The Dialog State Tracking Challenge Series , 2014, AI Mag..

[113]  Alexander I. Rudnicky,et al.  A Wizard-of-Oz Study on A Non-Task-Oriented Dialog Systems That Reacts to User Engagement , 2016, SIGDIAL Conference.

[114]  Alexander I. Rudnicky,et al.  Creating natural dialogs in the carnegie mellon communicator system , 1999, EUROSPEECH.

[115]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[116]  Jacob W. Crandall,et al.  Towards Minimizing Disappointment in Repeated Games , 2014, J. Artif. Intell. Res..

[117]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[118]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[119]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[120]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[121]  Ferran Argelaguet,et al.  The role of interaction in virtual embodiment: Effects of the virtual hand representation , 2016, 2016 IEEE Virtual Reality (VR).

[122]  Katja Hofmann,et al.  Balancing Exploration and Exploitation in Learning to Rank Online , 2011, ECIR.

[123]  Anca D. Dragan,et al.  Learning Human Objectives by Evaluating Hypothetical Behavior , 2019, ICML.

[124]  Jianfeng Gao,et al.  A User Simulator for Task-Completion Dialogues , 2016, ArXiv.

[125]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[126]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[127]  Vasile Rus,et al.  A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.

[128]  Harry Shum,et al.  From Eliza to XiaoIce: challenges and opportunities with social chatbots , 2018, Frontiers of Information Technology & Electronic Engineering.

[129]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[130]  Shane Legg,et al.  Scalable agent alignment via reward modeling: a research direction , 2018, ArXiv.

[131]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[132]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[133]  M. de Rijke,et al.  A Neural Click Model for Web Search , 2016, WWW.

[134]  Nicholas Jing Yuan,et al.  Beyond the Words: Predicting User Personality from Heterogeneous Information , 2017, WSDM.

[135]  Jaime Teevan,et al.  How people recall, recognize, and reuse search results , 2008, ACM Trans. Inf. Syst..

[136]  Joseph Weizenbaum,et al.  ELIZA—a computer program for the study of natural language communication between man and machine , 1966, CACM.

[137]  E. Gumbel Statistical Theory of Extreme Values and Some Practical Applications : A Series of Lectures , 1954 .

[138]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[139]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[140]  Ryen W. White,et al.  Finding relevant documents using top ranking sentences: an evaluation of two alternative schemes , 2002, SIGIR '02.

[141]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[142]  M. de Rijke,et al.  Rethinking Supervised Learning and Reinforcement Learning in Task-Oriented Dialogue Systems , 2020, FINDINGS.

[143]  Ryen W. White Interactions with Search Systems , 2016 .

[144]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[145]  Paul B. Kantor,et al.  A study of information seeking and retrieving. II. Users, questions, and effectiveness , 1988 .

[146]  Jianfeng Gao,et al.  A Persona-Based Neural Conversation Model , 2016, ACL.

[147]  M. de Rijke,et al.  Towards Learning Reward Functions from User Interactions , 2017, ICTIR.

[148]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[149]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[150]  Matthew Lease,et al.  Correlation and Prediction of Evaluation Metrics in Information Retrieval , 2018, ArXiv.

[151]  Paul B. Kantor,et al.  A study of information seeking and retrieving. I. background and methodology , 1988 .

[152]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[153]  Carlo Tomasi,et al.  Distance Minimization for Reward Learning from Scored Trajectories , 2016, AAAI.

[154]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[155]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[156]  Wei Chu,et al.  Cohort modeling for enhanced personalized search , 2014, SIGIR.

[157]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[158]  David M. Bradley,et al.  Boosting Structured Prediction for Imitation Learning , 2006, NIPS.

[159]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..

[160]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[161]  J. Mitchell,et al.  Dynamic versus static menus: an exploratory comparison , 1989, SGCH.

[162]  Filip Radlinski,et al.  Preference elicitation as an optimization problem , 2018, RecSys.

[163]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[164]  David Silver,et al.  Learning to search: Functional gradient techniques for imitation learning , 2009, Auton. Robots.

[165]  Romain Laroche,et al.  Reward Shaping for Statistical Optimisation of Dialogue Management , 2013, SLSP.

[166]  Jaap Kamps,et al.  Behavioral Dynamics from the SERP's Perspective: What are Failed SERPs and How to Fix Them? , 2015, CIKM.

[167]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.