Feature Extraction Using Power-Law Adjusted Linear Prediction With Application to Speaker Recognition Under Severe Vocal Effort Mismatch

Linear prediction is one of the most established techniques in signal estimation, and it is widely utilized in speech signal processing. It has been long understood that the nerve firing rate of human auditory system can be approximated by power law non-linearity, and this has been the motivation behind using perceptual linear prediction in extracting acoustic features in a variety of speech processing applications. In this paper, we revisit the application of power law non-linearity in speech spectrum estimation by compressing/expanding power spectrum in autocorrelation-based linear prediction. The development of so-called LP- α is motivated by a desire to obtain spectral features that present less mismatch than conventionally used spectrum estimation methods when speech of normal loudness is compared to speech under vocal effort. The effectiveness of the proposed approach is demonstrated in a speaker recognition task conducted under severe vocal effort mismatch comparing shouted versus normal speech mode.

[1]  Vesa V Alim Aki Discrete-Time Modeling of Acoustic Tubes Using Fractional Delay Filters , 1995 .

[2]  Paavo Alku,et al.  Comparison of spectrum estimators in speaker verification: mismatch conditions induced by vocal effort , 2013, INTERSPEECH.

[3]  Hynek Hermansky,et al.  Analysis and synthesis of speech based on spectral transform linear predictive method , 1983, ICASSP.

[4]  D G Childers,et al.  Speech synthesis by glottal excited linear prediction. , 1994, The Journal of the Acoustical Society of America.

[5]  L. Carney,et al.  A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. , 2001, The Journal of the Acoustical Society of America.

[6]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[7]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[8]  Soren Y Lowell,et al.  Aerodynamic and Acoustic Features of Vocal Effort , 2019 .

[9]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[10]  Paavo Alku,et al.  Mixture Linear Prediction in Speaker Verification Under Vocal Effort Mismatch , 2014, IEEE Signal Processing Letters.

[11]  Paavo Alku,et al.  Regularized All-Pole Models for Speaker Verification Under Noisy Environments , 2012, IEEE Signal Processing Letters.

[12]  J. Makhoul Stable and efficient lattice methods for linear prediction , 1977 .

[13]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[14]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[15]  D. Hardt,et al.  Spectral subtraction and RASTA-filtering in text-dependent HMM-based speaker verification , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Ronald W. Schafer,et al.  Introduction to Digital Speech Processing , 2007, Found. Trends Signal Process..

[17]  Paavo Alku,et al.  Speaker identification from shouted speech: Analysis and compensation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Paavo Alku,et al.  Detection of shouted speech in noise: human and machine. , 2013, The Journal of the Acoustical Society of America.

[19]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  R. Kumaresan,et al.  Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications , 1999 .

[21]  Hynek Hermansky,et al.  Speech enhancement using linear prediction residual , 1999, Speech Commun..

[22]  H. Traunmüller,et al.  Acoustic effects of variation in vocal effort by men, women, and children. , 2000, The Journal of the Acoustical Society of America.

[23]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[24]  Daniel P. W. Ellis,et al.  Frequency-domain linear prediction for temporal features , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[25]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[26]  Paavo Alku,et al.  Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification , 2010, IEEE Signal Processing Letters.

[27]  Paavo Alku,et al.  Stabilised weighted linear prediction , 2009, Speech Commun..

[28]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[30]  Arye Nehorai,et al.  On stability and root location of linear prediction models , 1987, IEEE Trans. Acoust. Speech Signal Process..

[31]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[32]  D. V. Leeuwen,et al.  The Radboud University Nijmegen submission to NIST SRE-2012 , 2012 .

[33]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[34]  John H. L. Hansen,et al.  I4u submission to NIST SRE 2012: a large-scale collaborative effort for noise-robust speaker verification , 2013, INTERSPEECH.

[35]  Yeunung Chen,et al.  Cepstral domain talker stress compensation for robust speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[36]  Tomohiro Nakatani,et al.  Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[38]  Tomas Bäckström,et al.  Effect of White-Noise Correction on Linear Predictive Coding , 2007, IEEE Signal Processing Letters.

[39]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[40]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[41]  Hynek Hermansky,et al.  Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP) , 1991, EUROSPEECH.

[42]  Paavo Alku,et al.  Analysis and synthesis of shouted speech , 2013, INTERSPEECH.

[43]  Antoine Liutkus,et al.  Generalized Wiener filtering with fractional power spectrograms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Manuel Duarte Ortigueira,et al.  Introduction to fractional linear systems. Part 1. Continuous-time case , 2000 .

[45]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[46]  Bolt Beranek a Linear Prediction , 1984 .

[47]  Manuel Duarte Ortigueira,et al.  Introduction to fractional linear systems. Part 2. Discrete-time case , 2000 .

[48]  Milan Sigmund,et al.  Impact of vocal effort variability on automatic speech recognition , 2012, Speech Commun..

[49]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[50]  Khaled Assaleh,et al.  Modeling of speech signals using fractional calculus , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[51]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[52]  Vesa Vlimki,et al.  Discrete-Time Modeling of Acoustic Tubes Using Fractional Delay Filters , 1998 .

[53]  Yves Kamp,et al.  Robust signal selection for linear prediction analysis of voiced speech , 1993, Speech Commun..

[54]  R. Schulman,et al.  Articulatory dynamics of loud and normal speech. , 1989, The Journal of the Acoustical Society of America.

[55]  Paavo Alku,et al.  Detection of Shouted Speech in the Presence of Ambient Noise , 2011, INTERSPEECH.

[56]  Paavo Alku,et al.  Human Cortical Dynamics Determined by Speech Fundamental Frequency , 2002, NeuroImage.

[57]  Hynek Hermansky,et al.  Recognition of Reverberant Speech Using Frequency Domain Linear Prediction , 2008, IEEE Signal Processing Letters.

[58]  S. S. Stevens On the psychophysical law. , 1957, Psychological review.

[59]  Bhaskar D. Rao,et al.  All-pole modeling of speech based on the minimum variance distortionless response spectrum , 2000, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[60]  A. Fairhall,et al.  Sensory adaptation , 2007, Current Opinion in Neurobiology.

[61]  W. Bastiaan Kleijn,et al.  Regularized Linear Prediction of Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Andreas Spanias,et al.  Speech coding: a tutorial review , 1994, Proc. IEEE.

[63]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[64]  W. B. Kleijn,et al.  Regularized linear prediction all-pole models , 2000, 2000 IEEE Workshop on Speech Coding. Proceedings. Meeting the Challenges of the New Millennium (Cat. No.00EX421).

[65]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[66]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[67]  Paavo Alku,et al.  Extended weighted linear prediction (XLP) analysis of speech and its application to speaker verification in adverse conditions , 2010, INTERSPEECH.