Voice source cepstrum processing for speaker identification

Voice source analysis and modelling has played a key role in important speech applications such as speech recognition, speech synthesis and speaker recognition. This work presents a robust algorithm for glottal closure detection and a novel set of voice source features for speaker recognition. In the rst part of the dissertation the DYPSA algorithm is developed for detecting glottal closure instants (GCIs). It includes a detailed study of group delay functions and their application to the linear prediction residual; glottal closure candidate generation from the group delay function; cost function design with regards to the properties of the speech signal at the point of closure; and dynamic programming algorithm used to reject unlikely glottal closure candidates. The DYPSA algorithm is evaluated on a speech database that includes simultaneous laryngograph recording to provide reference glottal closures instants. The algorithm achieves a 95.7% identi cation rate with 0.71 ms timing error standard deviation. In the second part of the dissertation GCI detection allows the vocal tract transfer function to be estimated using closed-phase analysis. This is converted to cepstrum coe cients (VTCC) and subtracted from the mel-frequency cepstrum coe cients (MFCC) to derive a set of voice source cepstrum coe cients (VSCC). These are then used for speaker identi cation on the TIMIT database. We show that although a classi er using MFCC performs better than one using VSCC, the combination of the two gives a signi cant improvement in recognition rate, illustrat-

[1]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[2]  Douglas A. Reynolds,et al.  Speaker detection and tracking for telephone transactions , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Official Hansard BRISBANE , 1917 .

[4]  Parcor Coeff,et al.  Comparison of Speaker Recognition Methods Using Statistical Features and Dynamic Features , 1981 .

[5]  David J. Braverman,et al.  Learning Filters for Optimum Pattern Recognition , 1962, IRE Trans. Inf. Theory.

[6]  Aaron E. Rosenberg,et al.  New techniques for automatic speaker verification , 1975 .

[7]  Douglas D. O'Shaughnessy,et al.  Interacting with computers by voice: automatic speech recognition and synthesis , 2003, Proc. IEEE.

[8]  D. M. Brookes,et al.  SPEAKER CHARACTERISTICS FROM A GLOTTAL AIRFLOW MODEL USING ROBUST INVERSE FILTERING , 1994 .

[9]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[10]  S. Crawford,et al.  Volume 1 , 2012, Journal of Diabetes Investigation.

[11]  Mark A. Clements,et al.  Glottal Models for Digital Speech Processing: A Historical Survey and New Results , 1995 .

[12]  Donald G. Childers,et al.  Automatic parameterization of vocal cord motion from ultra high speed films , 1980, ICASSP.

[13]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[14]  P. Mermelstein Articulatory model for the study of speech production. , 1973, The Journal of the Acoustical Society of America.

[15]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[16]  William M. Campbell,et al.  A SVM/HMM system for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Konstantinos Koumpis,et al.  Proceedings of the 6th International Conference on Spoken Language Processing , 2000 .

[18]  Frédéric Bimbot,et al.  A Monte-Carlo method for score normalization in Automatic Speaker Verification using Kullback-Leibler distances , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Lukás Burget,et al.  Combination of speech features using smoothed heteroscedastic linear discriminant analysis , 2004, INTERSPEECH.

[20]  L. Boves,et al.  On subglottal formant analysis. , 1987, The Journal of the Acoustical Society of America.

[21]  Douglas A. Reynolds,et al.  Modeling of the glottal flow derivative waveform with application to speaker identification , 1999, IEEE Trans. Speech Audio Process..

[22]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[23]  H. Teager Some observations on oral air flow during phonation , 1980 .

[24]  C.H. Coker,et al.  A model of articulatory dynamics and control , 1976, Proceedings of the IEEE.

[25]  J. Liljencrants,et al.  Dept. for Speech, Music and Hearing Quarterly Progress and Status Report a Four-parameter Model of Glottal Flow , 2022 .

[26]  M. Rothenberg A new inverse-filtering technique for deriving the glottal air flow waveform during voicing. , 1970, The Journal of the Acoustical Society of America.

[27]  Douglas A. Reynolds,et al.  Measuring fine structure in speech: application to speaker identification , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[28]  Helmer Strik,et al.  7th International Conference on Spoken Language Processing , 2002 .

[29]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[30]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[31]  H. K. Dunn The Calculation of Vowel Resonances , 1950 .

[32]  Douglas A. Reynolds,et al.  Fusing high- and low-level features for speaker recognition , 2003, INTERSPEECH.

[33]  Evelyn Abberton,et al.  Laryngographic assessment of normal voice: A tutorial , 1989 .

[34]  Patrick Kenny,et al.  Experiments in speaker verification using factor analysis likelihood ratios , 2004, Odyssey.

[35]  Inger Karlsson Glottal wave forms for normal female speakers , 1986 .

[36]  Douglas A. Reynolds,et al.  Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[37]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[38]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[39]  竹中 規雄 速報33 : 研削作用の研究(第1報) , 1950 .

[40]  William M. Campbell,et al.  Generalized linear discriminant sequence kernels for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  R. Lummis,et al.  Speaker verification by computer using speech intensity for temporal registration , 1973 .

[42]  R. H. Dalaqua University College, London , 1910, Nature.

[43]  Björn Granström,et al.  Developments and paradigms in intonation research , 2001, Speech Commun..

[44]  Hiroya Fujisaki,et al.  Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the Glottal source waveform , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  D. A. Reynolds,et al.  The effects of handset variability on speaker recognition performance: experiments on the Switchboard corpus , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[46]  Steve Renals,et al.  SVMSVM: support vector machine speaker verification methodology , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[47]  Ritu Sharma Speech Synthesis , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[48]  Kuldip K. Paliwal,et al.  Speech Coding and Synthesis , 1995 .

[49]  Aarnout Brombacher,et al.  Probability... , 2009, Qual. Reliab. Eng. Int..

[50]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[52]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[53]  Servicio Geológico Colombiano Sgc Volume 4 , 2013, Journal of Diabetes Investigation.

[54]  藤村 靖,et al.  Vocal physiology : voice production, mechanisms, and functions , 1988 .

[55]  R. P. Ramachandran,et al.  Robust speaker recognition: a feature-based approach , 1996, IEEE Signal Processing Magazine.

[56]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[57]  Tsuhan Chen,et al.  Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition , 2005, IEEE Transactions on Multimedia.

[58]  A.P. Benguerel,et al.  Speech analysis , 1981, Proceedings of the IEEE.

[59]  Wolfgang Hess,et al.  Accurate pitch determination of speech signals by means of a laryngograph , 1984, ICASSP.

[60]  Douglas A. Reynolds,et al.  Modeling prosodic dynamics for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[61]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[62]  Elizabeth Shriberg,et al.  Using prosodic and lexical information for speaker identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[63]  Bayya Yegnanarayana,et al.  A robust method for determining instants of major excitations in voiced speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[64]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[65]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[66]  Victor N. Sorokin,et al.  Determination of vocal tract shape for vowels , 1992, Speech Commun..

[67]  Donald G. Childers,et al.  Variability in closed phase analysis of speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  Dock Bumpers,et al.  Volume 2 , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[69]  Lawrence G. Bahler,et al.  Speaker verification using randomized phrase prompting , 1991, Digit. Signal Process..

[70]  Hideki Kasuya,et al.  A novel approach to the estimation of voice source and vocal tract parameters from speech signals , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[71]  H. K. Dunn Methods of Measuring Vowel Formant Bandwidths , 1961 .

[72]  Glottal waveform parameters for different speaker types , 2007 .

[73]  Per Hedelin A glottal LPC-vocoder , 1984, ICASSP.

[74]  Hynek Hermansky,et al.  Segmentation of speech for speaker and language recognition , 2003, INTERSPEECH.

[75]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[76]  H. Strube Determination of the instant of glottal closure from the speech wave. , 1974, The Journal of the Acoustical Society of America.