Non-native acoustic modeling for mispronunciation verification based on language adversarial representation learning

Non-native mispronunciation verification is designed to provide feedback to guide language learners to correct their pronunciation errors in their further learning and it plays an important role in the computer-aided pronunciation training (CAPT) system. Most existing approaches focus on establishing the acoustic model directly using non-native corpus thus they are suffering the data sparsity problem due to time-consuming non-native speech data collection and annotation tasks. In this work, to address this problem, we propose a pre-trained approach to utilize the speech data of two native languages (the learner's native and target languages) for non-native mispronunciation verification. We set up an unsupervised model to extract knowledge from a large scale of unlabeled raw speech of the target language by making predictions about future observations in the speech signal, then the model is trained with language adversarial training using the learner's native language to align the feature distribution of two languages by confusing a language discriminator. In addition, sinc filter is incorporated at the first convolutional layer to capture the formant-like feature. Formant is relevant to the place and manner of articulation. Therefore, it is useful not only for pronunciation error detection but also for providing instructive feedback. Then the pre-trained model serves as the feature extractor in the downstream mispronunciation verification task. Through the experiments on the Japanese part of the BLCU inter-Chinese speech corpus, the experimental results demonstrate that for the non-native phone recognition and mispronunciation verification tasks (1) the knowledge learned from two native languages speech with the proposed unsupervised approach is useful for these two tasks (2) our proposed language adversarial representation learning is effective to improve the performance (3) formant-like feature can be incorporated by introducing sinc filter to further improve the performance of mispronunciation verification.