ChatCache: A Hierarchical Semantic Redundancy Cache System for Conversational Services at Edge

The spatial-temporal locality has been observed in various scenarios for conversational services with either voice or text requests. Given the current cloud-based processing mechanism, integrating such a service with caching is a promising way to improve responsiveness, reduce in-network transmission, and avoid computational redundancy. Goes beyond precise redundancy and fuzzy redundancy, semantic redundancy adapts to the diversity in command expression, and is considered as a practical solution for conversational services. In this paper, we introduce a hierarchical cache design inspired by semantic redundancy for conversational services. We propose a scalable edge system ChatCache to incorporate the hierarchical cache design and serve single or multiple users. We discussed the cache efficiency with different similarity match policies, and evaluate the responsiveness and scalability of ChatCache on heterogeneous edge platforms. On most of the evaluated platforms, ChatCache reduces user-perceived latency by more than 91.7% for voice requests, more than 81.6% for text requests. The throughput of ChatCache reaches 42.6 throughput tps for voice requests, and 64.4 tps for text requests, which is comparable with mainstream cloud cognitive services. The promising evaluation results show the capability of ChatCache in reducing the user-perceived latency and computation redundancy with high response accuracy for conversational services.

[1]  Rajkumar Buyya,et al.  A Taxonomy and Survey of Content Delivery Networks , 2006 .

[2]  Bo Hu,et al.  FoggyCache: Cross-Device Approximate Computation Reuse , 2018, MobiCom.

[3]  Xin Wang,et al.  Clipper: A Low-Latency Online Prediction Serving System , 2016, NSDI.

[4]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[5]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[6]  Katherine Guo,et al.  Cachier: Edge-Caching for Recognition Applications , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[7]  A. Iyengar,et al.  CHA: A Caching Framework for Home-based Voice Assistant Systems , 2020, 2020 IEEE/ACM Symposium on Edge Computing (SEC).

[8]  Kevin Duh,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[9]  Similarity Caching: Theory and Algorithms , 2019, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[10]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[11]  Frank Bentley,et al.  Understanding the Long-Term Use of Smart Speaker Assistants , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[12]  Pawan Kumar,et al.  Improve performance of machine translation service using memcached , 2017, 2017 17th International Conference on Computational Science and Its Applications (ICCSA).

[13]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[14]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[15]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[16]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[17]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[18]  Wei Gao,et al.  MUVR: Supporting Multi-User Mobile Virtual Reality with Resource Constrained Edge Cloud , 2018, 2018 IEEE/ACM Symposium on Edge Computing (SEC).

[19]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[20]  Thrasyvoulos Spyropoulos,et al.  Femto-Caching with Soft Cache Hits: Improving Performance with Related Content Recommendation , 2017, GLOBECOM 2017 - 2017 IEEE Global Communications Conference.

[21]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[22]  Walid Saad,et al.  Proactive edge computing in latency-constrained fog networks , 2017, 2017 European Conference on Networks and Communications (EuCNC).

[23]  Salvatore Orlando,et al.  A metric cache for similarity search , 2008, LSDS-IR '08.

[24]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[25]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[26]  Pieter Hintjens,et al.  ZeroMQ: Messaging for Many Applications , 2013 .

[27]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[28]  Fabrizio Lillo,et al.  Estimating the Total Volume of Queries to Google , 2019, WWW.

[29]  Bowen Zhou,et al.  Applying deep learning to answer selection: A study and an open task , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[30]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[31]  Thrasyvoulos Spyropoulos,et al.  Soft Cache Hits: Improving Performance Through Recommendation and Delivery of Related Content , 2018, IEEE Journal on Selected Areas in Communications.

[32]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Yongming Huang,et al.  Proactive Caching for Vehicular Multi-View 3D Video Streaming via Deep Reinforcement Learning , 2019, IEEE Transactions on Wireless Communications.