Device-directed Utterance Detection

In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free follow-up queries. Consider the example interaction: $"Computer,~play~music", "Computer,~reduce~the~volume"$. In this interaction, the user needs to repeat the wake-word ($Computer$) for the second query. To allow for more natural interactions, the device could immediately re-enter listening state after the first query (without wake-word repetition) and accept or reject a potential follow-up as device-directed or background speech. The proposed model consists of two long short-term memory (LSTM) neural networks trained on acoustic features and automatic speech recognition (ASR) 1-best hypotheses, respectively. A feed-forward deep neural network (DNN) is then trained to combine the acoustic and 1-best embeddings, derived from the LSTMs, with features from the ASR decoder. Experimental results show that ASR decoder, acoustic embeddings, and 1-best embeddings yield an equal-error-rate (EER) of $9.3~\%$, $10.9~\%$ and $20.1~\%$, respectively. Combination of the features resulted in a $44~\%$ relative improvement and a final EER of $5.2~\%$.

[1]  Roland Maas,et al.  Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tanja Schultz,et al.  Tue-SeA Real-Time Speech Command Detector for a Smart Control Room , 2011, INTERSPEECH.

[3]  Sree Hari Krishnan Parthasarathi,et al.  fMLLR based feature-space speaker adaptation of DNN acoustic models , 2015, INTERSPEECH.

[4]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[5]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[6]  Gökhan Tür,et al.  Understanding computer-directed utterances in multi-user dialog systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Tetsuya Takiguchi,et al.  System request detection in human conversation based on multi-resolution Gabor wavelet features , 2009, INTERSPEECH.

[8]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[9]  Sree Hari Krishnan Parthasarathi,et al.  Robust i-vector based adaptation of DNN acoustic model for speech recognition , 2015, INTERSPEECH.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[12]  Maarten Sierhuis,et al.  Are You Talking to Me? Dialogue Systems Supporting Mixed Teams of Humans and Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[13]  Andreas Stolcke,et al.  Using Out-of-Domain Data for Lexical Addressee Detection in Human-Human-Computer Dialog , 2013, HLT-NAACL.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[16]  Dilek Z. Hakkani-Tür,et al.  Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog , 2012, INTERSPEECH.