Neural network based feature transformation for emotion independent speaker identification

In this paper we are proposing neural network based feature transformation framework for developing emotion independent speaker identification system. Most of the present speaker recognition systems may not perform well during emotional environments. In real life, humans extensively express emotions during conversations for effectively conveying the messages. Therefore, in this work we propose the speaker recognition system, robust to variations in emotional moods of speakers. Neural network models are explored to transform the speaker specific spectral features from any specific emotion to neutral. In this work, we have considered eight emotions namely, Anger, Sad, Disgust, Fear, Happy, Neutral, Sarcastic and Surprise. The emotional databases developed in Hindi, Telugu and German are used in this work for analyzing the effect of proposed feature transformation on the performance of speaker identification system. In this work, spectral features are represented by mel-frequency cepstral coefficients, and speaker models are developed using Gaussian mixture models. Performance of the speaker identification system is analyzed with various feature mapping techniques. Results have demonstrated that the proposed neural network based feature transformation has improved the speaker identification performance by 20 %. Feature transformation at the syllable level has shown the better performance, compared to sentence level.

[1]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[2]  Ismail Shahin,et al.  Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models , 2006, Speech Commun..

[3]  Nick Campbell,et al.  Perception of affect in speech - towards an automatic processing of paralinguistic information in spoken conversation , 2004, INTERSPEECH.

[4]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[5]  Thomas Fang Zheng,et al.  Study on speaker verification on emotional speech , 2006, INTERSPEECH.

[6]  I. Shahin Speaker Identification in Emotional Environments , 2010 .

[7]  Douglas A. Reynolds,et al.  Approaches to Speaker Detection and Tracking in Conversational Speech , 2000, Digit. Signal Process..

[8]  B. Yegnanarayana,et al.  Online text-independent speaker verification system using autoassociative neural network models , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[9]  K. Sreenivasa Rao,et al.  Voice conversion by mapping the speaker-specific features using pitch synchronous approach , 2010, Comput. Speech Lang..

[10]  Kishore Prahallad,et al.  AANN: an alternative to GMM for pattern recognition , 2002, Neural Networks.

[11]  Shashidhar G. Koolagudi,et al.  IITKGP-SESC: Speech Database for Emotion Analysis , 2009, IC3.

[12]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[13]  K. Sreenivasa Rao,et al.  Improved consonant–vowel recognition for low bit-rate coded speech , 2012 .

[14]  John H. L. Hansen,et al.  Feature analysis and neural network-based classification of speech under stress , 1996, IEEE Trans. Speech Audio Process..

[15]  V. Ramu Reddy,et al.  Development of syllable-based text to speech synthesis system in Bengali , 2011, Int. J. Speech Technol..

[16]  Bayya Yegnanarayana,et al.  Prosody modification using instants of significant excitation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  K. Sreenivasa Rao,et al.  Vowel Onset Point Detection for Low Bit Rate Coded Speech , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[19]  Shai Fine,et al.  A hybrid GMM/SVM approach to speaker identification , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  John H. L. Hansen,et al.  Generating stressed speech from neutral speech using a modified CELP vocoder , 1996, Speech Commun..

[21]  Salwani Abdullah,et al.  Great Deluge Algorithm for Rough Set Attribute Reduction , 2010, FGIT-DTA/BSBT.

[22]  Jacob Benesty,et al.  Springer Handbook of Speech Processing and Communication , 2007 .

[23]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[24]  Kishore Prahallad,et al.  Spectral Mapping Using Artificial Neural Networks for Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Roger K. Moore Computer Speech and Language , 1986 .

[26]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[27]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[28]  Kishore Prahallad,et al.  Source and system features for speaker recognition using AANN models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  Bayya Yegnanarayana,et al.  Prosodic features for speaker verification , 2006, INTERSPEECH.

[30]  K.S. Rao,et al.  Transformation of Speaker Characteristics in Speech Using Support Vector Machines , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[31]  S. R. M. Prasanna,et al.  Significance of Vowel-Like Regions for Speaker Verification Under Degraded Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  S. R. Mahadeva Prasanna,et al.  Neutral to Target Emotion Conversion Using Source and Suprasegmental Information , 2011, INTERSPEECH.

[33]  Douglas A. Reynolds,et al.  Fusing high- and low-level features for speaker recognition , 2003, INTERSPEECH.

[34]  Bayya Yegnanarayana,et al.  Voice Conversion by Prosody and Vocal Tract Modification , 2006, 9th International Conference on Information Technology (ICIT'06).

[35]  K. Shikano,et al.  Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  Bayya Yegnanarayana,et al.  Duration modification using glottal closure instants and vowel onset points , 2009, Speech Commun..

[37]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[38]  S. R. Mahadeva Prasanna,et al.  Recognition of consonant-vowel (CV) units under background noise using combined temporal and spectral preprocessing , 2011, Int. J. Speech Technol..

[39]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[40]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[41]  Bayya Yegnanarayana,et al.  Modeling durations of syllables using neural networks , 2007, Comput. Speech Lang..

[42]  K SREENIVASA RAO,et al.  Role of neural network models for developing speech systems , 2011 .

[43]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion based on GMM and straight , 2001, INTERSPEECH.

[44]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[45]  Zhaohui Wu,et al.  Emotion-State Conversion for Speaker Recognition , 2005, ACII.

[46]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[47]  Bayya Yegnanarayana,et al.  Intonation modeling for Indian languages , 2009, Comput. Speech Lang..

[48]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[49]  K. S. Rao,et al.  IITKGP-SEHSC : Hindi Speech Corpus for Emotion Analysis , 2011, 2011 International Conference on Devices and Communications (ICDeCom).

[50]  S. Dandapat,et al.  Speaker recognition under stressed condition , 2010, Int. J. Speech Technol..

[51]  Shashidhar G. Koolagudi,et al.  Voice Transformation by Mapping the Features at Syllable Level , 2007, PReMI.

[52]  K. Sreenivasa Rao,et al.  Application of prosody models for developing speech systems in Indian languages , 2011, Int. J. Speech Technol..