The exploitation of Multiple Feature Extraction Techniques for Speaker Identification in Emotional States under Disguised Voices

Due to improvements in artificial intelligence, speaker identification (SI) technologies have brought a great direction and are now widely used in a variety of sectors. One of the most important components of SI is feature extraction, which has a substantial impact on the SI process and performance. As a result, numerous feature extraction strategies are thoroughly investigated, contrasted, and analyzed. This article exploits five distinct feature extraction methods for speaker identification in disguised voices under emotional environments. To evaluate this work significantly, three effects are used: high-pitched, low-pitched, and Electronic Voice Conversion (EVC). Experimental results reported that the concatenated Mel-Frequency Cepstral Coefficients (MFCCs), MFCCs-delta, and MFCCs-delta-delta is the best feature extraction method. Keywords— Convolutional neural network, disguised voices, emotional environments, speaker identification, support vector machine, mel-frequency cepstral coefficients.

[1]  Shreya Narang,et al.  Speech Feature Extraction Techniques: A Review , 2015 .

[2]  Cuiling Zhang Acoustic analysis of disguised voices with raised and lowered pitch , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[3]  Kemal Polat,et al.  Novel dual-channel long short-term memory compressed capsule networks for emotion recognition , 2021, Expert Syst. Appl..

[4]  Samuel Kim,et al.  A pitch synchronous feature extraction method for speaker recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Namrata Dave,et al.  Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition , 2013 .

[6]  Ismail Shahin,et al.  Novel third-order hidden Markov models for speaker identification in shouted talking environments , 2014, Eng. Appl. Artif. Intell..

[7]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[8]  Brian Hanson,et al.  Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Ruhul Amin,et al.  Bangladeshi dialect recognition using Mel Frequency Cepstral Coefficient, Delta, Delta-delta and Gaussian Mixture Model , 2016, 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI).

[10]  W. Endres,et al.  Voice spectrograms as a function of age, voice disguise, and voice imitation. , 1971, The Journal of the Acoustical Society of America.

[11]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[12]  Ismail Shahin,et al.  Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments , 2018, Neural Computing and Applications.

[13]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[14]  D. O'Shaughnessy,et al.  Linear predictive coding , 1988, IEEE Potentials.

[15]  K. Jacob,et al.  Design of a Novel Hybrid Algorithm for Improved Speech Recognition with Support Vector Machines Classifier , 2013 .

[16]  Ismail Shahin,et al.  Emarati speaker identification , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[17]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[18]  Elif Derya Übeyli Combined neural network model employing wavelet coefficients for EEG signals classification , 2009, Digit. Signal Process..

[19]  Ivan Grech,et al.  Comparative study of automatic speech recognition techniques , 2013, IET Signal Process..

[20]  Ismail Shahin,et al.  Identifying speakers using their emotion cues , 2011, Int. J. Speech Technol..

[21]  Ling He,et al.  Using information theoretic vector quantization for inverted MFCC based speaker verification , 2009, 2009 2nd International Conference on Computer, Control and Communication.

[22]  Goutam Saha,et al.  On the Use of Distributed DCT in Speaker Identification , 2009, 2009 Annual IEEE India Conference.

[23]  Liang Lu,et al.  Probabilistic Linear Discriminant Analysis for Acoustic Modeling , 2014, IEEE Signal Processing Letters.

[24]  Ching Y. Suen,et al.  A novel hybrid CNN-SVM classifier for recognizing handwritten digits , 2012, Pattern Recognit..

[25]  Kemal Polat,et al.  Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments , 2021, Neural Computing and Applications.

[26]  Ismail Shahin,et al.  Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2s and SPHMMs , 2012, Journal on Multimodal User Interfaces.

[27]  Gérard Chollet,et al.  Voice Disguise and Automatic Detection: Review and Perspectives , 2005, WNSP.

[28]  Ismail Shahin,et al.  Speaker Identification for Disguised Voices Based on Modified SVM Classifier , 2021, 2021 18th International Multi-Conference on Systems, Signals & Devices (SSD).

[29]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[30]  Thomas Fang Zheng,et al.  Overview of Front-end Features for Robust Speaker Recognition , 2011 .

[31]  Zhizheng Wu,et al.  Voice conversion and spoofing attack on speaker verification systems , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[32]  K. Paliwal,et al.  Quantization of LPC Parameters , 2022 .

[33]  Keikichi Hirose,et al.  CASA-Based Speaker Identification Using Cascaded GMM-CNN Classifier in Noisy and Emotional Talking Conditions , 2021, Appl. Soft Comput..

[34]  Rekha Hibare,et al.  Feature Extraction Techniques in Speech Processing: A Survey , 2014 .

[35]  G.S. Moschytz,et al.  Practical fast 1-D DCT algorithms with 11 multiplications , 1989, International Conference on Acoustics, Speech, and Signal Processing,.