Evaluating Voice Interaction Pipelines at the Edge

With the recent releases of Alexa Voice Services and Google Home, voice-driven interactive computing is quickly become commonplace. Voice interactive applications incorporate multiple components including complex speech recognition and translation algorithms, natural language understanding and generation capabilities, as well as custom compute functions commonly referred to as skills. Voice-driven interactive systems are composed of software pipelines using these components. These pipelines are typically resource intensive and must be executed quickly to maintain dialogue-consistent latencies. Consequently, voice interaction pipelines are usually computed entirely in the cloud. However, for many cases, cloud connectivity may not be practical and require these voice interactive pipelines be executed at the edge. In this paper, we evaluate the impact of pushing voice-driven pipelines to computationally-weak edge devices. Our primary motivation is to enable voice-driven interfaces for first responders during emergencies, such as building fires, when connectivity to the cloud is impractical. We first characterize the end-to-end performance of a complete open source voice interaction pipeline for four different configurations ranging from entirely cloud-based to completely edge-based. We also identify potential optimization opportunities to enable voice-drive interaction pipelines to be fully executed at computationally-weak edge devices at lower response latencies than high-performance cloud services