Significance of analytic phase of speech signals in speaker verification

Importance of analytic phase in human perception of speaker identity is verified.Features are extracted from derivative of analytic phase, referred to as IFCCs.IFCCs are found suitable for text-dependant & independent speaker recognition systems.Speaker verification performance of IFCCs is comparable to MFCC and FDLP features.Fusion of i-vectors from different features based systems improves speaker verification. The objective of this paper is to establish the importance of phase of analytic signal of speech, referred to as the analytic phase, in human perception of speaker identity, as well as in automatic speaker verification. Subjective studies are conducted using analytic phase distorted speech signals, and the adversities occurred in human speaker verification task are observed. Motivated from the perceptual studies, we propose a method for feature extraction from analytic phase of speech signals. As unambiguous computation of analytic phase is not possible due to the phase wrapping problem, feature extraction is attempted from its derivative, i.e., the instantaneous frequency (IF). The IF is computed by exploiting the properties of the Fourier transform, and this strategy is free from the phase wrapping problem. The IF is computed from narrowband components of speech signal, and discrete cosine transform is applied on deviations in IF to pack the information in smaller number of coefficients, which are referred to as IF cosine coefficients (IFCCs). The nature of information in the proposed IFCC features is studied using minimal-pair ABX (MP-ABX) tasks, and t-stochastic neighbor embedding (t-SNE) visualizations. The performance of IFCC features is evaluated on NIST 2010 SRE database and is compared with mel frequency cepstral coefficients (MFCCs) and frequency domain linear prediction (FDLP) features. All the three features, IFCC, FDLP and MFCC, provided competitive speaker verification performance with average EERs of 2.3%, 2.2% and 2.4%, respectively. The IFCC features are more robust to vocal effort mismatch, and provided relative improvements of 26% and 11% over MFCC and FDLP features, respectively, on the evaluation conditions involving vocal effort mismatch. Since magnitude and phase represent different components of the speech signal, we have attempted to fuse the evidences from them at the i-vector level of speaker verification system. It is found that the i-vector fusion is considerably better than the conventional scores fusion. The i-vector fusion of FDLP+IFCC features provided a relative improvement of 36% over the system based on FDLP features alone, while the fusion of MFCC+IFCC provided a relative improvement of 37% over the system based on MFCC alone, illustrating that the proposed IFCC features provide complementary speaker specific information to the magnitude based FDLP and MFCC features.

[1]  Fan-Gang Zeng,et al.  Speech recognition with amplitude and frequency modulations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  L. Griffiths Rapid measurement of digital instantaneous frequency , 1975 .

[3]  Bin Ma,et al.  Evaluation of a fused FM and cepstral-based speaker recognition system on the NIST 2008 SRE , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Tomi Kinnunen,et al.  Joint Acoustic-Modulation Frequency for Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Ashlesha Jain,et al.  Advanced Computational Intelligence Paradigms in Healthcare - 1 (Studies in Computational Intelligence) , 2007 .

[6]  Vinay Kumar,et al.  Feature extraction from analytic phase of speech signals for speaker verification , 2014, INTERSPEECH.

[7]  Petros Maragos,et al.  On amplitude and frequency demodulation using energy operators , 1993, IEEE Trans. Signal Process..

[8]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[9]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[10]  E. Ambikairajah,et al.  Extraction of FM components from speech signals using all-pole model , 2008 .

[11]  I. Saratxaga,et al.  Simple representation of signal phase for harmonic speech models , 2009 .

[12]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[13]  A. Oppenheim,et al.  Computation of the One-Dimensional Unwrapped Phase , 2007, 2007 15th International Conference on Digital Signal Processing.

[14]  Kuldip K. Paliwal,et al.  Frequency-related representation of speech , 2003, INTERSPEECH.

[15]  Peter C. Doerschuk,et al.  Statistical AM-FM models, extended Kalman filter demodulation, Cramer-Rao bounds, and speech analysis , 2000, IEEE Trans. Signal Process..

[16]  Bishnu S. Atal,et al.  Decomposing speech into formants: A new look at an old problem , 1978 .

[17]  Petros Maragos,et al.  Speech analysis and synthesis using an AM-FM modulation model , 1997, Speech Commun..

[18]  Thomas F. Quatieri,et al.  AM-FM separation using auditory-motivated filters , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[20]  Eliathamby Ambikairajah,et al.  Speaker Identification using FM Features , 2006 .

[21]  The NIST Year 2010 Speaker Recognition Evaluation Plan 1 I NTRODUCTION , 2022 .

[22]  A. D. Yarmey,et al.  The effects of whispers, voice‐sample duration, and voice distinctiveness on criminal speaker identification , 1995 .

[23]  John H. L. Hansen,et al.  Mean Hilbert Envelope Coefficients (MHEC) for Robust Speaker Recognition , 2012, INTERSPEECH.

[24]  J. F. Kaiser,et al.  On a simple algorithm to calculate the 'energy' of a signal , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[25]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[26]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[27]  Kuldip K. Paliwal,et al.  Short-time phase spectrum in speech processing: A review and some experimental results , 2007, Digit. Signal Process..

[28]  Volker Hohmann,et al.  Acoustic features for speech recognition based on Gammatone filterbank and instantaneous frequency , 2011, Speech Commun..

[29]  P. Maragos,et al.  Speech formant frequency and bandwidth tracking using multiband energy demodulation , 1996 .

[30]  Dennis Gabor,et al.  Theory of communication , 1946 .

[31]  N. Huang,et al.  The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis , 1998, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[32]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[33]  Jong Ho Won,et al.  Use of Amplitude Modulation Cues Recovered from Frequency Modulation for Cochlear Implant Users When Original Speech Cues Are Severely Degraded , 2014, Journal of the Association for Research in Otolaryngology.

[34]  Reinhold Greisbach Estimation of speaker height from formant frequencies , 1999 .

[35]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[36]  Giorgio Biagetti,et al.  Multicomponent AM–FM Representations: An Asymptotically Exact Approach , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[38]  Mohamed Kamal Omar,et al.  Feature normalization for speaker verification in room reverberation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Petros Maragos,et al.  Robust AM-FM features for speech recognition , 2005, IEEE Signal Processing Letters.

[40]  Hynek Hermansky,et al.  Robust Feature Extraction Using Modulation Filtering of Autoregressive Models , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Petros Maragos,et al.  A comparison of the energy operator and the Hilbert transform approach to signal and speech demodulation , 1994, Signal Process..

[42]  Karthika Vijayan,et al.  Analysis of features from analytic representation of speech using MP-ABX measures , 2015, INTERSPEECH.

[43]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[44]  Haizhou Li,et al.  ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition , 2013, INTERSPEECH.

[45]  Yannis Stylianou,et al.  Adaptive AM–FM Signal Decomposition With Application to Speech Analysis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  David Vakman,et al.  On the analytic signal, the Teager-Kaiser energy algorithm, and other methods for defining amplitude and frequency , 1996, IEEE Trans. Signal Process..

[47]  Kuldip K. Paliwal,et al.  On the usefulness of STFT phase spectrum in human listening tests , 2005, Speech Commun..

[48]  Bin Ma,et al.  Sparse Classifier Fusion for Speaker Verification , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[49]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[50]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[51]  S. Marple Computing the discrete-time 'analytic' signal via FFT , 1997 .

[52]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[53]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[54]  Dimitrios Dimitriadis,et al.  Short-time instantaneous frequency and bandwidth features for speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[55]  Jace Wolfe,et al.  Evaluation of speech recognition in noise with cochlear implants and dynamic FM. , 2009, Journal of the American Academy of Audiology.

[56]  Boualem Boashash,et al.  Estimating and interpreting the instantaneous frequency of a signal. I. Fundamentals , 1992, Proc. IEEE.

[57]  Lakhmi C. Jain,et al.  Introduction to Computational Intelligence in Healthcare , 2007, Advanced Computational Intelligence Paradigms in Healthcare - 2.

[58]  Bayya Yegnanarayana,et al.  Epoch Extraction From Speech Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[59]  D. A. van Leeuwen,et al.  A human benchmark for automatic speaker recognition , 2013 .

[60]  Paavo Alku,et al.  AM-FM based filter bank analysis for estimation of spectro-temporal envelopes and its application for speaker recognition in noisy reverberant environments , 2015, INTERSPEECH.

[61]  Daniel P. W. Ellis,et al.  Autoregressive Modeling of Temporal Envelopes , 2007, IEEE Transactions on Signal Processing.

[62]  Dimitrios Dimitriadis,et al.  Spectral Moment Features Augmented by Low Order Cepstral Coefficients for Robust ASR , 2010, IEEE Signal Processing Letters.

[63]  Eliathamby Ambikairajah,et al.  Computationally efficient frame-averaged FM feature extraction for speaker recognition , 2009 .

[64]  S. Gupta,et al.  First-Order Discrete Phase-Locked Loop with Applications to Demodulation of Angle-Modulated Carrier , 1972, IEEE Trans. Commun..

[65]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[66]  Ramdas Kumaresan,et al.  On decomposing speech into modulated components , 2000, IEEE Trans. Speech Audio Process..

[67]  Lakhmi C. Jain,et al.  Advanced Computational Intelligence Paradigms in Healthcare - 2 , 2007, Advanced Computational Intelligence Paradigms in Healthcare - 2.

[68]  Fred Cummins,et al.  Speaker Identification Using Instantaneous Frequencies , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[69]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[70]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[71]  Yun Lei,et al.  Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[72]  Irwin Pollack,et al.  On the Identification of Speakers by Voice , 1954 .

[73]  R. Kumaresan,et al.  Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications , 1999 .

[74]  Douglas D. O'Shaughnessy,et al.  Whispered speaker verification and gender detection using weighted instantaneous frequencies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[75]  S. K. Mittal Theory of Communication , 2012 .