论文信息 - Device-directed Utterance Detection

Device-directed Utterance Detection

In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free follow-up queries. Consider the example interaction: $"Computer,~play~music", "Computer,~reduce~the~volume"$. In this interaction, the user needs to repeat the wake-word ($Computer$) for the second query. To allow for more natural interactions, the device could immediately re-enter listening state after the first query (without wake-word repetition) and accept or reject a potential follow-up as device-directed or background speech. The proposed model consists of two long short-term memory (LSTM) neural networks trained on acoustic features and automatic speech recognition (ASR) 1-best hypotheses, respectively. A feed-forward deep neural network (DNN) is then trained to combine the acoustic and 1-best embeddings, derived from the LSTMs, with features from the ASR decoder. Experimental results show that ASR decoder, acoustic embeddings, and 1-best embeddings yield an equal-error-rate (EER) of $9.3~\%$, $10.9~\%$ and $20.1~\%$, respectively. Combination of the features resulted in a $44~\%$ relative improvement and a final EER of $5.2~\%$.

[1] Roland Maas,et al. Combining Acoustic Embeddings and Decoding Features for End-of-Utterance Detection in Real-Time Far-Field Speech Recognition Systems , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Tanja Schultz,et al. Tue-SeA Real-Time Speech Command Detector for a Smart Control Room , 2011, INTERSPEECH.

[3] Sree Hari Krishnan Parthasarathi,et al. fMLLR based feature-space speaker adaptation of DNN acoustic models , 2015, INTERSPEECH.

[4] Mark J. F. Gales,et al. The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[5] Andreas Stolcke,et al. Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[6] Gökhan Tür,et al. Understanding computer-directed utterances in multi-user dialog systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Tetsuya Takiguchi,et al. System request detection in human conversation based on multi-resolution Gabor wavelet features , 2009, INTERSPEECH.

[8] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[9] Sree Hari Krishnan Parthasarathi,et al. Robust i-vector based adaptation of DNN acoustic model for speech recognition , 2015, INTERSPEECH.

[10] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[11] Gunnar Evermann,et al. Posterior probability decoding, confidence estimation and system combination , 2000 .

[12] Maarten Sierhuis,et al. Are You Talking to Me? Dialogue Systems Supporting Mixed Teams of Humans and Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[13] Andreas Stolcke,et al. Using Out-of-Domain Data for Lexical Addressee Detection in Human-Human-Computer Dialog , 2013, HLT-NAACL.

[14] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[16] Dilek Z. Hakkani-Tür,et al. Learning When to Listen: Detecting System-Addressed Speech in Human-Human-Computer Dialog , 2012, INTERSPEECH.