Articulation Rate Filtering of CQCC Features for Automatic Speaker Verification

This paper introduces a new articulation rate filter and reports its combination with recently proposed constant Q cepstral coefficients (CQCCs) in their first application to automatic speaker verification (ASV). CQCC features are extracted with the constant Q transform (CQT), a perceptually-inspired alternative to Fourier-based approaches to time-frequency analysis. The CQT offers greater frequency resolution at lower frequencies and greater time resolution at higher frequencies. When coupled with cepstral analysis and the new articulation rate filter, the resulting CQCC features are readily modelled using conventional techniques. A comparative assessment of CQCCs and mel frequency cepstral coefficients (MFCC) for a short-duration speaker verification scenario shows that CQCCs generally outperform MFCCs and that the two feature representations are highly complementary; fusion experiments with the RSR2015 and RedDots databases show relative reductions in equal error rates of as much as 60% compared to an MFCC baseline.

[1]  Judith C. Brown Calculation of a constant Q spectral transform , 1991 .

[2]  Rudolf E. Radocy,et al.  Psychological Foundations of Musical Behavior , 1979 .

[3]  Martin Vetterli,et al.  Wavelets and filter banks: theory and design , 1992, IEEE Trans. Signal Process..

[4]  Nicholas W. D. Evans,et al.  A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients , 2016, Odyssey.

[5]  Wang Lei,et al.  A study on echo feature extraction based on the modified relative spectra (RASTA) and perception linear prediction (PLP) auditory model , 2010, 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[6]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[7]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[8]  Lou Boves,et al.  Phase-corrected RASTA for automatic speech recognition over the phone , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Tomi Kinnunen,et al.  Local spectral variability features for speaker verification , 2016, Digit. Signal Process..

[10]  Bin Ma,et al.  The reddots data collection for speaker recognition , 2015, INTERSPEECH.

[11]  S. Hamid Nawab,et al.  Improved musical pitch tracking using principal decomposition analysis , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  B. Friedlander,et al.  The Modified Yule-Walker Method of ARMA Spectral Estimation , 1984, IEEE Transactions on Aerospace and Electronic Systems.

[13]  W. J. Pielemeier,et al.  Time-frequency analysis of musical signals , 1996, Proc. IEEE.

[14]  Stphane Mallat,et al.  A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way , 2008 .

[15]  Thomas Grill,et al.  CONSTRUCTING AN INVERTIBLE CONSTANT-Q TRANSFORM WITH NONSTATIONARY GABOR FRAMES , 2011 .

[16]  Aleksandr Sizov,et al.  ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge , 2015, INTERSPEECH.

[17]  Roberto Basili,et al.  SVM based transcription system with short-term memory oriented to polyphonic piano music , 2010, Melecon 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference.

[18]  Bin Ma,et al.  Extended RSR2015 for text-dependent speaker verification over VHF channel , 2014, INTERSPEECH.

[19]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[20]  Anssi Klapuri,et al.  A Matlab Toolbox for Efficient Perfect Reconstruction Time-Frequency Transforms with Log-Frequency Resolution , 2014, Semantic Audio.

[21]  Giovanni Costantini,et al.  Event based transcription system for polyphonic piano music , 2009, Signal Process..

[22]  Steven F. Boll,et al.  Constant-Q signal analysis and synthesis , 1978, ICASSP.