Feature Extraction Methods for Speaker Recognition: A Review

This paper presents main paradigms of research for feature extraction methods to further augment the state of art in speaker recognition (SR) which has been recognized extensively in person identif...

[1]  Frank K. Soong,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[2]  Fred Cummins,et al.  Speaker Identification Using Instantaneous Frequencies , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Douglas A. Reynolds,et al.  Fusing high- and low-level features for speaker recognition , 2003, INTERSPEECH.

[4]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[5]  Richard M. Stern,et al.  Features Based on Auditory Physiology and Perception , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[6]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[7]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[8]  Olli Viikki,et al.  A recursive feature vector normalization approach for robust speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  E. Lenneberg Biological Foundations of Language , 1967 .

[10]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[11]  DeLiang Wang,et al.  Robust speaker identification using auditory features and computational auditory scene analysis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Paavo Alku,et al.  On separating glottal source and vocal tract information in telephony speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Qi Li,et al.  An Auditory-Based Feature Extraction Algorithm for Robust Speaker Identification Under Mismatched Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Qi Li,et al.  An auditory-based transfrom for audio signal processing , 2009, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[15]  Fan-Gang Zeng,et al.  Encoding frequency Modulation to improve cochlear implant performance in noise , 2005, IEEE Transactions on Biomedical Engineering.

[16]  A. E. Rosenberg,et al.  Evaluation of an automatic speaker-verification system over telephone lines , 1976, The Bell System Technical Journal.

[17]  John H. L. Hansen,et al.  A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition , 2008, Speech Commun..

[18]  Mark J. F. Gales,et al.  Noisy Constrained Maximum-Likelihood Linear Regression for Noise-Robust Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Douglas A. Reynolds,et al.  Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[20]  Petros Maragos,et al.  Time-frequency distributions for automatic speech recognition , 2001, IEEE Trans. Speech Audio Process..

[21]  John H. L. Hansen,et al.  Nonlinear analysis and classification of speech under stressed conditions , 1994 .

[22]  Alejandro Acero,et al.  Acoustical and environmental robustness in automatic speech recognition , 1991 .

[23]  K. Ohkura,et al.  Speech recognition in a noisy environment using a noise reduction neural network and a codebook mapping technique , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Tomi Kinnunen,et al.  Spectral Features for Automatic Text-Independent Speaker Recognition , 2003 .

[25]  Malayappan Shridhar,et al.  Text-independent speaker recognition using orthogonal linear prediction , 1981, ICASSP.

[26]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[27]  José L. Pérez-Córdoba,et al.  Histogram equalization of speech representation for robust speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[28]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  A. Hussain,et al.  Nonlinear speech processing: Overview and applications , 2002 .

[30]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[31]  Jen-Tzung Chien,et al.  Aggregate a posteriori linear regression adaptation , 2006, IEEE Trans. Speech Audio Process..

[32]  Kuldip K. Paliwal,et al.  Usefulness of phase spectrum in human speech perception , 2003, INTERSPEECH.

[33]  Jiqing Han,et al.  Sparse-Based auditory Model for robust speaker Recognition , 2012, Int. J. Pattern Recognit. Artif. Intell..

[34]  Geoffrey Zweig,et al.  LATTICE-BASED UNSUPERVISED MLLR FOR SPEAKER ADAPTATION , 2000 .

[35]  Hema A. Murthy ALGORITHMS FOR PROCESSING FOURIER TRANSFORM PHASE OF SIGNALS , 1991 .

[36]  Themos Stafylakis,et al.  Combining amplitude and phase-based features for speaker verification with short duration utterances , 2015, INTERSPEECH.

[37]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[38]  Wei Wang,et al.  Speaker Verification via Modeling Kurtosis Using Sparse Coding , 2016, Int. J. Pattern Recognit. Artif. Intell..

[39]  Elizabeth Shriberg,et al.  Higher-Level Features in Speaker Recognition , 2007, Speaker Classification.

[40]  Rajesh M. Hegde,et al.  Significance of the Modified Group Delay Feature in Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  Kuldip K. Paliwal,et al.  Frequency-related representation of speech , 2003, INTERSPEECH.

[42]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[43]  Hong-Goo Kang,et al.  Speaker recognition based on transformed line spectral frequencies , 2004, Proceedings of 2004 International Symposium on Intelligent Signal Processing and Communication Systems, 2004. ISPACS 2004..

[44]  Haizhou Li,et al.  Feature Normalization Using Structured Full Transforms for Robust Speech Recognition , 2011, INTERSPEECH.

[45]  Eliathamby Ambikairajah,et al.  FM features for automatic forensic speaker recognition , 2008, INTERSPEECH.

[46]  John G. Harris,et al.  Human factor cepstral coefficients , 2002 .

[47]  Pedro J. Moreno,et al.  Speech recognition in noisy environments , 1996 .

[48]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[49]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[50]  Abeer Alwan,et al.  A model of dynamic auditory perception and its application to robust word recognition , 1997, IEEE Trans. Speech Audio Process..

[51]  Amita Pal,et al.  Speaker Identification by Aggregating Gaussian Mixture Models (GMMs) Based on Uncorrelated MFCC-derived Features , 2014, Int. J. Pattern Recognit. Artif. Intell..

[52]  Douglas E. Sturim,et al.  The MIT lincoln laboratory 2008 speaker recognition system , 2009, INTERSPEECH.

[53]  Mark J. F. Gales,et al.  Cepstral parameter compensation for HMM recognition in noise , 1993, Speech Commun..

[54]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[55]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[56]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[57]  J. Allen,et al.  Nonlinear phenomena as observed in the ear canal and at the auditory nerve. , 1985, The Journal of the Acoustical Society of America.

[58]  Kuldip K. Paliwal,et al.  Short-time phase spectrum in speech processing: A review and some experimental results , 2007, Digit. Signal Process..

[59]  Gang Li,et al.  Signal representation based on instantaneous amplitude models with application to speech synthesis , 2000, IEEE Trans. Speech Audio Process..

[60]  A. Hudspeth,et al.  Essential nonlinearities in hearing. , 2000, Physical review letters.

[61]  Judith Markowitz The Many Roles of Speaker Classification in Speaker Verification and Identification , 2007, Speaker Classification.

[62]  Denis Jouvet,et al.  Evaluation of a noise-robust DSR front-end on Aurora databases , 2002, INTERSPEECH.

[63]  Douglas E. Sturim,et al.  Classification Methods for Speaker Recognition , 2007, Speaker Classification.

[64]  James R. Glass,et al.  Robust Speaker Recognition in Noisy Conditions , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[65]  Bayya Yegnanarayana,et al.  Combining evidence from residual phase and MFCC features for speaker recognition , 2006, IEEE Signal Processing Letters.

[66]  A. Marchal,et al.  Speech production and speech modelling , 1990 .

[67]  Qiang Huo,et al.  A Study of Minimum Classification Error (MCE) Linear Regression for Supervised Adaptation of MCE-Trained Continuous-Density Hidden Markov Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[68]  B. Yegnanarayana,et al.  Significance of group delay functions in signal reconstruction from spectral magnitude or phase , 1984 .

[69]  B. Moore,et al.  Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. , 1983, The Journal of the Acoustical Society of America.

[70]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[71]  Yifan Gong,et al.  High-performance hmm adaptation with joint compensation of additive and convolutive distortions via Vector Taylor Series , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[72]  Laurent Mauuary,et al.  Blind equalization for robust telephone based speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[73]  H. M. Teager,et al.  Evidence for Nonlinear Sound Production Mechanisms in the Vocal Tract , 1990 .

[74]  Tim Jürgens,et al.  NOISE ROBUST DISTANT AUTOMATIC SPEECH RECOGNITION UTILIZING NMF BASED SOURCE SEPARATION AND AUDITORY FEATURE EXTRACTION , 2013 .

[75]  Shantanu Chakrabartty,et al.  Sparse Auditory Reproducing Kernel (SPARK) Features for Noise-Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[76]  David R. Cole,et al.  Speaker recognition in reverberant enclosures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[77]  Robert B. Dunn,et al.  Speech enhancement based on auditory spectral change , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[78]  Mark J. F. Gales,et al.  Unsupervised Adaptation With Discriminative Mapping Transforms , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[79]  Chin-Hui Lee,et al.  A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[80]  Andreas Stolcke,et al.  The ICSI-SRI Spring 2006 Meeting Recognition System , 2006, MLMI.

[81]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  Rong Tong,et al.  The I4U system in NIST 2008 speaker recognition evaluation , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[83]  Abeer Alwan,et al.  Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR , 2005, IEEE Transactions on Speech and Audio Processing.

[84]  Nikos Fakotakis,et al.  Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task , 2007 .

[85]  Lin-Shan Lee,et al.  Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition , 2009, IEEE Trans. Speech Audio Process..

[86]  S. Molau,et al.  Feature space normalization in adverse acoustic conditions , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[87]  A. Gray,et al.  Long-term feature averaging for speaker recognition , 1977 .

[88]  Wai Nang Chan,et al.  Discrimination Power of Vocal Source and Vocal Tract Related Features for Speaker Segmentation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[89]  Alfred Mertins,et al.  Contextual invariant-integration features for improved speaker-independent speech recognition , 2011, Speech Commun..

[90]  Dimitrios Dimitriadis,et al.  Spectral Moment Features Augmented by Low Order Cepstral Coefficients for Robust ASR , 2010, IEEE Signal Processing Letters.

[91]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[92]  DeLiang Wang,et al.  Robust Speaker Identification in Noisy and Reverberant Conditions , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[93]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[94]  Günther Palm,et al.  On the use of residual cepstrum in speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[95]  DeLiang Wang,et al.  A computational auditory scene analysis system for speech segregation and robust speech recognition , 2010, Comput. Speech Lang..

[96]  K. Takaya,et al.  Recognition of syllables in a continuous stream of speech by PARCOR parameters of linear predictive vocoder , 2005, Canadian Conference on Electrical and Computer Engineering, 2005..

[97]  K. I. Ramachandran,et al.  Towards improving the performance of text/language independent speaker recognition systems , 2014, 2014 International Conference on Power Signals Control and Computations (EPSCICON).

[98]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[99]  Fan-Gang Zeng,et al.  Speech recognition with amplitude and frequency modulations. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Jan Van der Spiegel,et al.  Robust auditory-based speech processing using the average localized synchrony detection , 2002, IEEE Trans. Speech Audio Process..

[101]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[102]  Abeer Alwan,et al.  Evaluation of noise robust features on the Aurora databases , 2002, INTERSPEECH.

[103]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[104]  Yan Ming Cheng,et al.  SNR-dependent waveform processing for improving the robustness of ASR front-end , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[105]  Chin-Hui Lee,et al.  Joint maximum a posteriori adaptation of transformation and HMM parameters , 2001, IEEE Trans. Speech Audio Process..

[106]  Kuldip K. Paliwal,et al.  Product of power spectrum and group delay function for speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[107]  Tomi Kinnunen,et al.  Joint Acoustic-Modulation Frequency for Speaker Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[108]  N. Kitaoka Speech recognition under noisy environments using spectral subtraction with smoothing of time direction and real-time cepstral mean normalization , 2001 .

[109]  Mark D Skowronski,et al.  Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. , 2004, The Journal of the Acoustical Society of America.

[110]  G. Miet,et al.  Speech enhancement via frequency bandwidth extension using line spectral frequencies , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[111]  Dong Jin Seo,et al.  An approach to robust unsupervised speaker adaptation , 2005, IEEE Signal Process. Lett..

[112]  William M. Campbell,et al.  Speaker Verification Using Support Vector Machines and High-Level Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[113]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[114]  Frank K. Soong,et al.  An auditory system-based feature for robust speech recognition , 2001, INTERSPEECH.

[115]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[116]  Douglas D. O'Shaughnessy,et al.  Multitaper MFCC and PLP features for speaker verification using i-vectors , 2013, Speech Commun..

[117]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[118]  Abeer Alwan,et al.  On the use of variable frame rate analysis in speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[119]  E. Ambikairajah,et al.  Extraction of FM components from speech signals using all-pole model , 2008 .

[120]  Seung Ho Choi,et al.  Cepstrum third-order normalisation method for noisy speech recognition , 1999 .

[121]  Daniel P. W. Ellis,et al.  Feature extraction using non-linear transformation for robust speech recognition on the Aurora database , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[122]  Qiang Fu,et al.  Robust Glottal Source Estimation Based on Joint Source-Filter Model Optimization , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[123]  E. E. David,et al.  Artificial Auditory Recognition in Telephony , 1958, IBM J. Res. Dev..

[124]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[125]  Bayya Yegnanarayana,et al.  Significance of group delay functions in spectrum estimation , 1992, IEEE Trans. Signal Process..

[126]  Hema A. Murthy,et al.  The modified group delay function and its application to phoneme recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[127]  P. Maragos,et al.  Speech formant frequency and bandwidth tracking using multiband energy demodulation , 1996 .

[128]  Birger Kollmeier,et al.  Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[129]  Petros Maragos,et al.  Robust AM-FM features for speech recognition , 2005, IEEE Signal Processing Letters.

[130]  Yifan Gong,et al.  HMM adaptation using a phase-sensitive acoustic distortion model for environment-robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.