Multi-Task Learning with Neural Networks for Voice Query Understanding on an Entertainment Platform

We tackle the challenge of understanding voice queries posed against the Comcast Xfinity X1 entertainment platform, where consumers direct speech input at their "voice remotes". Such queries range from specific program navigation (i.e., watch a movie) to requests with vague intents and even queries that have nothing to do with watching TV. We present successively richer neural network architectures to tackle this challenge based on two key insights: The first is that session context can be exploited to disambiguate queries and recover from ASR errors, which we operationalize with hierarchical recurrent neural networks. The second insight is that query understanding requires evidence integration across multiple related tasks, which we identify as program prediction, intent classification, and query tagging. We present a novel multi-task neural architecture that jointly learns to accomplish all three tasks. Our initial model, already deployed in production, serves millions of queries daily with an improved customer experience. The novel multi-task learning model, first described here, is evaluated through carefully-controlled laboratory experiments, which demonstrates further gains in effectiveness and increased system capabilities.

[1]  Jimmy J. Lin,et al.  Exploring the Effectiveness of Convolutional Neural Networks for Answer Selection in End-to-End Question Answering , 2017, ArXiv.

[2]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[3]  Jimmy J. Lin,et al.  Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search , 2018, AAAI.

[4]  Dong Yu,et al.  An introduction to voice search , 2008, IEEE Signal Processing Magazine.

[5]  Daqing He,et al.  How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search , 2013, SIGIR.

[6]  Jimmy J. Lin,et al.  Experiments with Convolutional Neural Network Models for Answer Selection , 2017, SIGIR.

[7]  Mohammad Al Hasan,et al.  Name Disambiguation in Anonymized Graphs using Network Embedding , 2017, CIKM.

[8]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[9]  Jimmy J. Lin,et al.  Integrating Lexical and Temporal Signals in Neural Ranking Models for Searching Social Media Streams , 2017, ArXiv.

[10]  Milad Shokouhi,et al.  Did You Say U2 or YouTube?: Inferring Implicit Transcripts from Voice Search Logs , 2016, WWW.

[11]  Ciprian Chelba,et al.  Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search , 2013 .

[12]  Ramakanth Pasunuru,et al.  Multi-Task Video Captioning with Video and Entailment Generation , 2017, ACL.

[13]  Jimmy J. Lin,et al.  A cascade ranking model for efficient ranked retrieval , 2011, SIGIR.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Xuanjing Huang,et al.  Adversarial Multi-task Learning for Text Classification , 2017, ACL.

[16]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[17]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[18]  Z. Hasan A Survey on Shari’Ah Governance Practices in Malaysia, GCC Countries and the UK , 2011 .

[19]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[20]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[22]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[23]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[24]  Jimmy J. Lin,et al.  Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks , 2016, CIKM.

[25]  Ferhan Türe,et al.  What Do Viewers Say to Their TVs?: An Analysis of Voice Queries to Entertainment Systems , 2018, SIGIR.

[26]  Junlan Feng,et al.  Effects of Word Confusion Networks on Voice Search , 2009, EACL.

[27]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[28]  Ramakanth Pasunuru,et al.  Reinforced Video Captioning with Entailment Rewards , 2017, EMNLP.

[29]  Jimmy J. Lin,et al.  UMD-TTIC-UW at SemEval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement , 2016, *SEMEVAL.

[30]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[31]  Yu Zhang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[32]  Ido Guy,et al.  Searching by Talking: Analysis of Voice Queries on Mobile Web Search , 2016, SIGIR.

[33]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[34]  Larry S. Davis,et al.  ReMotENet: Efficient Relevant Motion Event Detection for Large-Scale Home Surveillance Videos , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[35]  Jimmy J. Lin,et al.  Talking to Your TV: Context-Aware Voice Search with Hierarchical Recurrent Neural Networks , 2017, CIKM.

[36]  Murat Dundar,et al.  Simplicity of Kmeans Versus Deepness of Deep Learning: A Case of Unsupervised Feature Learning with Limited Data , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[37]  Umut Ozertem,et al.  Characterizing and Predicting Voice Query Reformulation , 2015, CIKM.

[38]  Jiulong Shan,et al.  Search by voice in Mandarin Chinese , 2010, INTERSPEECH.