Speech Source Discrimination Method for Plural Voice User Interfaces Environment

Voice user interface (VUI), which enables users to control devices such as smart speakers and smartphones by voice, is becoming more popular. However, in an environment where there are plural VUI devices nearby, VUI devices have a problem of mis-respond to playback voices from other devices, such as another VUI device's response voices, narration voices, handsfree telephone voices, and so on. Therefore, in this study, we proposed a speech source discrimination method for plural VUIs environment using convolutional neural network (CNN). In this method, we used mel-frequency cepstrum coefficients (MFCC) features, that is used especially for speech signal processing, as the input features for the CNN. From the experimental results, it was confirmed that the proposed method can discriminate the speech source of voices of the same speakers and sentences to the training data between four kinds of the speech sources, i.e. direct voice, playback voice, and synthesized voice, with 97.5 % accuracy. In addition, we improved the discrimination accuracy of the speech source of different speakers and sentences with the method of speech data split.