Speaker characterization using spectral subband energy ratio based on Harmonic plus Noise Model

This paper proposes a feature extraction for speaker characterization by exploring the relationship between the two distinct components of the speech signal, one is harmonics accounting for the periodicity of the signal and the other is modulated noise accounting for the turbulences of the glottal airflow. The harmonic and noise parts of the speech signal are decomposed based on the Harmonic plus Noise Model approach. We estimate the spectral subband energy ratios (SSERs) as the speaker characteristic features, which are expected to reflect the interaction property of the vocal tract and glottal airflow of individual speakers for speaker verification. The speaker verification experiments based on a GMM-UBM system have shown the efficiency of the SSER features, reducing the error equal rate by 27.2% by combining with the conventional MFCC features.

[1]  Ramesh A. Gopinath,et al.  Short-time Gaussianization for robust speaker verification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  A Kohlrausch,et al.  Psychoacoustical evaluation of the pitch-synchronous overlap-and-add speech-waveform manipulation technique using single-formant stimuli. , 1997, The Journal of the Acoustical Society of America.

[4]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[5]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[6]  Yiqing Zu Sentence design for speech synthesis and speech recognition database by phonetic rules , 1997, EUROSPEECH.

[7]  Yannis Stylianou,et al.  Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification , 1996 .

[8]  Tomi Kinnunen,et al.  What else is new than the hamming window? robust MFCCs for speaker recognition via multitapering , 2010, INTERSPEECH.

[9]  Thierry Dutoit,et al.  A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis , 2019, INTERSPEECH.

[10]  Aaron E. Rosenberg,et al.  A comparative performance study of several pitch detection algorithms , 1976 .

[11]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Douglas A. Jones,et al.  Beyond Cepstra : Exploiting High-Level Information in Speaker Recognition , 2003 .