Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation

False triggers in voice assistants are unintended invocations of the assistant, which not only degrade the user experience but may also compromise privacy. False trigger mitigation (FTM) is a process to detect the false trigger events and respond appropriately to the user. In this paper, we propose a novel solution to the FTM problem by introducing a parallel ASR decoding process with a special language model trained from "out-of-domain" data sources. Such language model is complementary to the existing language model optimized for the assistant task. A bidirectional lattice RNN (Bi-LRNN) classifier trained from the lattices generated by the complementary language model shows a $38.34\%$ relative reduction of the false trigger (FT) rate at the fixed rate of $0.4\%$ false suppression (FS) of correct invocations, compared to the current Bi-LRNN model. In addition, we propose to train a parallel Bi-LRNN model based on the decoding lattices from both language models, and examine various ways of implementation. The resulting model leads to further reduction in the false trigger rate by $10.8\%$.

[1]  Devang Naik,et al.  Lattice-Based Improvements for Voice Triggering Using Graph Neural Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[3]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[4]  Erik Marchi,et al.  Efficient Voice Trigger Detection for Low Resource Hardware , 2018, INTERSPEECH.

[5]  Woojay Jeon,et al.  Voice Trigger Detection from Lvcsr Hypothesis Lattices Using Bidirectional Lattice Recurrent Neural Networks , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Petar S. Aleksic,et al.  Keyword spotting for Google assistant using contextual speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7]  Sri Harish Reddy Mallidi,et al.  Device-directed Utterance Detection , 2018, INTERSPEECH.

[8]  Che-Wei Huang,et al.  A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction , 2019, INTERSPEECH.

[9]  Xiaodan Zhuang,et al.  SNDCNN: Self-Normalizing Deep CNNs with Scaled Exponential Linear Units for Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[11]  Arindam Mandal,et al.  Monophone-Based Background Modeling for Two-Stage On-Device Wake Word Detection , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Ariya Rastrow,et al.  LatticeRnn: Recurrent Neural Networks Over Lattices , 2016, INTERSPEECH.