Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features

Conversations in the intelligent vehicles are usually short utterance. As the durations of the short utterances are small (e.g., less than 3 s), it is difficult to learn sufficient information to distinguish the type of languages. In this paper, we propose an end-to-end short utterances based speech language identification (SLI) approach, which is especially suitable for the short utterance based language identification. This approach is implemented with a long short-term memory (LSTM) neural network, which is designed for the SLI application in intelligent vehicles. The features used for LSTM learning are generated by a transfer learning method. The bottleneck features of a deep neural network, which are obtained for a mandarin acoustic-phonetic classifier, are used for the LSTM training. In order to improve the SLD accuracy with short utterances, a phase vocoder based time-scale modification method is utilized to reduce/increase the speech rate of the test utterance. By connecting the normal, speech rate reduced, and speech rate increased utterances, we can extend the length of the test utterances such that the performance of the SLI system is improved. The experimental results on the AP17-OLR database demonstrate that the proposed method can improve the performance of SLD, especially on short utterance. The proposed SLI has robust performance under the vehicular noisy environment.

[1]  Gang Peng,et al.  Investigations on Mandarin Aspiratory Animations Using an Airflow Model , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Bin Ma,et al.  Spoken Language Recognition With Prosodic Features , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Shrikanth S. Narayanan,et al.  Robust Speech Rate Estimation for Spontaneous Speech , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Shiwen Mao,et al.  CSI-Based Fingerprinting for Indoor Localization: A Deep Learning Approach , 2016, IEEE Transactions on Vehicular Technology.

[5]  Li-Rong Dai,et al.  LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification , 2016, Odyssey.

[6]  Zhanyu Ma,et al.  Adversarial Network Bottleneck Features for Noise Robust Speaker Verification , 2017, INTERSPEECH.

[7]  Tomoki Toda,et al.  Duration-Controlled LSTM for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Bhiksha Raj,et al.  Privacy-Preserving Speaker Verification and Identification Using Gaussian Mixture Models , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Dong Wang,et al.  Phonetic Temporal Neural Model for Language Identification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Tadahiro Taniguchi,et al.  Determining Utterance Timing of a Driving Agent With Double Articulation Analyzer , 2016, IEEE Transactions on Intelligent Transportation Systems.

[11]  Jun Guo,et al.  DNN Filter Bank Cepstral Coefficients for Spoofing Detection , 2017, IEEE Access.

[12]  Monika Dörfler,et al.  A Phase Vocoder Based on Nonstationary Gabor Frames , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[14]  Jun Guo,et al.  Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[15]  Douglas A. Reynolds,et al.  Approaches to language identification using Gaussian mixture models and shifted delta cepstral features , 2002, INTERSPEECH.

[16]  Najim Dehak,et al.  Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks , 2018, IEEE Access.

[17]  T Ifukube,et al.  A portable digital speech-rate converter for hearing impairment. , 1996, IEEE transactions on rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society.

[18]  Mark Liberman,et al.  Towards an integrated understanding of speaking rate in conversation , 2006, INTERSPEECH.

[19]  Mahdi Boloursaz,et al.  Capacity Bounds and Detection Schemes for Data Over Voice , 2016, IEEE Transactions on Vehicular Technology.

[20]  John H. L. Hansen,et al.  An i-Vector PLDA based gender identification approach for severely distorted and multilingual DARPA RATS data , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Yun Lei,et al.  Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Hynek Hermansky,et al.  Robust Feature Extraction Using Modulation Filtering of Autoregressive Models , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[24]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[26]  J. Foil,et al.  Language identification using noisy speech , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[28]  Wonjung Kim,et al.  VoIP Capacity Analysis in Full Duplex WLANs , 2017, IEEE Transactions on Vehicular Technology.

[29]  Seiichi Nakagawa,et al.  Speaker-independent, text-independent language identification by HMM , 1992, ICSLP.

[30]  Jean Laroche,et al.  Improved phase vocoder time-scale modification of audio , 1999, IEEE Trans. Speech Audio Process..

[31]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[32]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[33]  Yeshwant K. Muthusamy,et al.  A Segmental Approach to Automatic Language Identification , 1993 .

[34]  Nei Kato,et al.  State-of-the-Art Deep Learning: Evolving Machine Intelligence Toward Tomorrow’s Intelligent Network Traffic Control Systems , 2017, IEEE Communications Surveys & Tutorials.

[35]  Man-Hung Siu,et al.  Automatic language identification using discrete hidden Markov model , 2004, INTERSPEECH.

[36]  Nei Kato,et al.  The Deep Learning Vision for Heterogeneous Network Traffic Control: Proposal, Challenges, and Future Perspective , 2017, IEEE Wireless Communications.

[37]  Min Chen,et al.  Energy-Efficient and Context-Aware Smartphone Sensor Employment , 2015, IEEE Transactions on Vehicular Technology.

[38]  Nei Kato,et al.  Routing or Computing? The Paradigm Shift Towards Intelligent Computer Network Packet Transmission Based on Deep Learning , 2017, IEEE Transactions on Computers.

[39]  Lukás Burget,et al.  Brno University of Technology System for NIST 2005 Language Recognition Evaluation , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[40]  Dong Wang,et al.  AP17-OLR challenge: Data, plan, and baseline , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[41]  Youngsoo Jang,et al.  Cross-Language Neural Dialog State Tracker for Large Ontologies Using Hierarchical Attention , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[43]  William M. Campbell,et al.  Language recognition with support vector machines , 2004, Odyssey.

[44]  Kandarpa Kumar Sarma,et al.  Long-Term Critical Band Energy-Based Feature Set for Dialect Identification Using a Neuro-Fuzzy Approach , 2018, IEEE Intelligent Systems.

[45]  Hirokazu Kameoka,et al.  Generative Modeling of Voice Fundamental Frequency Contours , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[46]  Ronald A. Cole,et al.  A segment-based approach to automatic language identification , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[47]  Vidhyasaharan Sethu,et al.  Bidirectional Modelling for Short Duration Language Identification , 2017, INTERSPEECH.

[48]  Yi Liu,et al.  Investigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition , 2016, Odyssey.

[49]  Jen-Tzung Chien,et al.  Multisource I-Vectors Domain Adaptation Using Maximum Mean Discrepancy Based Autoencoders , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[51]  Igor Bisio,et al.  Smart and Robust Speaker Recognition for Context-Aware In-Vehicle Applications , 2018, IEEE Transactions on Vehicular Technology.

[52]  Aleksandr Sizov,et al.  Direct Optimization of the Detection Cost for I-Vector-Based Spoken Language Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[53]  Yan Yonghong,et al.  Expanding the length of short utterances for short-duration language recognition , 2018 .

[54]  Etienne Barnard,et al.  Speech rate normalization used to improve speaker verification , 2006 .

[55]  Nikko Strom,et al.  An Empirical Study of Cross-Lingual Transfer Learning Techniques for Small-Footprint Keyword Spotting , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).