Lattice-Based Improvements for Voice Triggering Using Graph Neural Networks

Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN). The proposed approach uses the fact that decoding lattice of a falsely triggered audio exhibits uncertainties in terms of many alternative paths and unexpected words on the lattice arcs as compared to the lattice of a correctly triggered audio. A pure trigger-phrase detector model doesn’t fully utilize the intent of the user speech whereas by using the complete decoding lattice of user audio, we can effectively mitigate speech not intended for the smart assistant. We deploy two variants of GNNs in this paper based on 1) graph convolution layers and 2) self-attention mechanism respectively. Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating ~87% of false triggers at 99% true positive rate (TPR). Furthermore, the proposed models are fast to train and efficient in parameter requirements.

[1]  Hao Ma,et al.  GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs , 2018, UAI.

[2]  Jaewoo Kang,et al.  Self-Attention Graph Pooling , 2019, ICML.

[3]  Erik Marchi,et al.  Efficient Voice Trigger Detection for Low Resource Hardware , 2018, INTERSPEECH.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Arindam Mandal,et al.  Monophone-Based Background Modeling for Two-Stage On-Device Wake Word Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Alex Waibel,et al.  Self-Attentional Models for Lattice Inputs , 2019, ACL.

[8]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.

[9]  Nikko Strom,et al.  Direct modeling of raw audio with DNNS for wake word detection , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[10]  Woojay Jeon,et al.  Voice Trigger Detection from Lvcsr Hypothesis Lattices Using Bidirectional Lattice Recurrent Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[12]  Daniel Willett,et al.  Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Nikko Strom,et al.  Time-Delayed Bottleneck Highway Networks Using a DFT Feature for Keyword Spotting , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Che-Wei Huang,et al.  A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction , 2019, INTERSPEECH.

[15]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..