A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction

In this paper, we extend our previous work on device-directed utterance detection, which aims to distinguish voice queries intended for a smart-home device from background speech. The task can be phrased as a binary utterance-level classification problem that we approach with a DNN-LSTM model using acoustic features and features from the automatic speech recognition (ASR) decoder as input. In this work, we study the performance of the model for different dialog types and for different categories of decoder features. To address different dialog types, we found that a model with a separate output branch for each dialog type outperforms a model with a shared output branch by a relative 12.5% of equal error rate (EER) reduction. We also found the average number of arcs in a confusion network to be one of the most informative ASR decoder features. In addition, we explore different frequencies of backward propagation for training the acoustic embedding for every k frames (k=1,3,5,7), and mean and attention pooling methods for generating an utterance representation. We found that attention pooling provides the most discriminative utterance representation and outperforms mean pooling by a relative 4.97% of EER reduction.