Articulatory-feature based pronunciation modelling for high-level speaker verification

Articulatory-Feature Based Pronunciation Modelling for High-Level Speaker Verification Speaker verification is a binary classification problem whose objective is to determine whether a test utterance was produced by a client speaker. Text-independent speaker verification systems typically extract speaker-dependent features from shortterm spectra of speech signals to build speaker-dependent Gaussian mixture models (GMMs). While this short-term spectral approach can achieve a reasonably good performance in controlled environment, the lack of robustness to real-world environment remains a serious problem. To improve the robustness of spectral-based systems, longterm high-level features have been investigated in recent years. Among the high-level features investigated, the use of articulatory features (AFs) for constructing conditional pronunciation models (CPMs) has been very promising. The resulting models are referred to as articulatory-feature based conditional pronunciation models, or simply AFCPMs. The drawback of AFCPMs, however, is that the pronunciation models are phoneme-dependent, meaning that they require one discrete density function for each phoneme. This dissertation demonstrates that this phoneme dependency leads to speaker models with low discriminative power, especially when the amount of training data is limited. To overcome this problem, this dissertation proposes four new techniques for articulatory-feature based pronunciation modeling. 1. Phonetic-Class Dependent AFCPM (CD-AFCPM). In this modeling technique, the density functions are conditioned on phonetic classes instead of phonemes. The phonetic classes are created from phonemes through three different mapping functions, which are obtained by (1) vector quantizing the discrete densities in the phoneme-dependent universal background models, (2) using the phone properties specified in the classical phoneme tree, and (3) combination of (1) and (2). 2. Probabilistic Weighting Scheme. In the original CD-AFCPM, all frames are considered to be equally important during the density estimation. However, frames that have a higher probability of belonging to the phonetic class being modeled should be given a greater weight. This dissertation, therefore, proposes a weighting scheme for computing the pronunciation models such that frames with a higher probability of belonging to a particular class will have a higher contribution to the model of that class. A new scoring method that uses an SVM to combine the scores generated from the phonetic-class models is also proposed. 3. Model Adaptation. Speaker verification based on high-level speaker features requires long enrolment utterances to be reliable. However, in practical speaker verification, it is common to model speakers based on a limited amount of enrolment data. To alleviate this problem, this dissertation proposes a new adaptation method for creating speaker models. The method not only adapts the phoneme-dependent background model but also the phoneme-independent speaker model. 4. Articulatory-Feature Kernels. The log-likelihood ratio scoring method in the original AFCPM does not explicitly use the discriminative information available in the training data because the target speaker models and background models are separately trained. This dissertation proposes converting the speaker models to supervectors in high-dimensional space by stacking the discrete densities in the AFCPMs. An AF-kernel is constructed from the supervectors of target speakers, background speakers, and claimants. Then, an SVM is discriminatively trained to classify the supervectors. These four techniques have been evaluated on the NIST 2000 dataset. The evaluation leads to five findings: 1. Among the three mapping functions, the one that combines the classical phoneme tree and Euclidean distance between AFCPMs achieves the best performance; 2. Phonetic-classes AFCPM achieves a significantly lower error rate as compared to conventional AFCPM; 3. The weighting scheme leads to better speaker models and hence helps to improve verification performance; 4. The proposed adaptation method, which uses as much information as possible from the training data, significantly outperforms the classical MAP adaptation method; and 5. The proposed AF-kernel is complementary to the likelihood-ratio scoring method, and their fusion can improve verification performance. ACKNOWLEDGMENTS I would like to express sincere gratitude to various bodies from The Hong Kong polytechnic University, where I have the opportunity to study with. My major debt is to my Supervisor Dr. M. W. Mak, whose expertise, understanding, and patience, added considerably to my graduate experience. I appreciate his vast knowledge and skill in many areas (e.g., speech, bioinformatics, machine learning, software engineering, interaction with participants), and his assistance in writing papers and this dissertation. I have learned a lot of things from him. Without his help, this study could not be completed. I would also like to thank Prof. Helen M. Meng, who is our coauthor, for her constructive comments and suggestions to improve our papers. Besides, I would like to express my appreciation to all the professors who have taught me during in my master study. The countless discussions with my teachers and their enthusiastic disabusing have proved to be fruitful and inspiring. I would also like to thank all members of staff of the department of Electronic and Information Engineering and the clerical staff in the General Office. They have created a creative environment for me to study in. Finally, it is my pleasure to acknowledge the Research and Postgraduate Studies Office of The Hong Kong Polytechnic University for its generous support over the past two years. Last but not least, I am indebted to my parents for their endless support and encouragement. Without them, this study would not have the chance to be completed.

[1]  D. Dahan,et al.  Interspeaker Variability in Emphatic Accent Production in French , 1996, Language and speech.

[2]  William M. Campbell,et al.  Generalized linear discriminant sequence kernels for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Man-Wai Mak,et al.  A Comparison of Various Adaptation Methods for Speaker Verification With Limited Enrollment Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[5]  Larry P. Heck,et al.  Modeling dynamic prosodic variation for speaker verification , 1998, ICSLP.

[6]  Douglas A. Reynolds,et al.  Corpora for the evaluation of speaker recognition systems , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[8]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[9]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[10]  Man-Wai Mak,et al.  Adaptive articulatory feature-based conditional pronunciation modeling for speaker verification , 2004, Speech Commun..

[11]  David Pearce,et al.  Speech recognition performance comparison between DSR and AMR transcoded speech , 2002, INTERSPEECH.

[12]  Douglas A. Reynolds,et al.  Channel robust speaker verification via feature mapping , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[13]  Shi-Xiong Zhang,et al.  Articulatory-feature based sequence kernel for high-level speaker verification , 2008, 2008 International Conference on Machine Learning and Cybernetics.

[14]  John H. L. Hansen,et al.  Speaker-specific pitch contour modeling and modification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[16]  Thomas H. Crystal,et al.  Speaker Verification by Human Listeners: Experiments Comparing Human and Machine Performance Using the NIST 1998 Speaker Evaluation Data , 2000, Digit. Signal Process..

[17]  Elizabeth Shriberg,et al.  Higher-Level Features in Speaker Recognition , 2007, Speaker Classification.

[18]  Eleonora Blaauw,et al.  The contribution of prosodic boundary markers to the perceptual difference between read and spontaneous speech , 1994, Speech Commun..

[19]  K. Moll,et al.  A cineradiographic study of VC and CV articulatory velocities , 1976 .

[20]  Man-Wai Mak,et al.  High-level feature-based speaker verification via articulatory phonetic-class pronunciation modeling , 2007, INTERSPEECH.

[21]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[22]  Joseph P. Campbell,et al.  Gender-dependent phonetic refraction for speaker recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Man-Wai Mak,et al.  Speaker Verification via High-Level Feature Based Phonetic-Class Pronunciation Modeling , 2007, IEEE Transactions on Computers.

[24]  Roland Auckenthaler,et al.  Improving a GMM speaker verification system by phonetic weighting , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[25]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[26]  Douglas A. Reynolds,et al.  The SuperSID project: exploiting high-level information for high-accuracy speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[27]  Elizabeth Shriberg,et al.  Using prosodic and lexical information for speaker identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Steve Renals,et al.  SVMSVM: support vector machine speaker verification methodology , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  Douglas A. Reynolds,et al.  Modeling prosodic dynamics for speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[30]  Douglas A. Reynolds,et al.  Conditional pronunciation modeling in speaker detection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[31]  Man-Wai Mak,et al.  High-level speaker verification via articulatory-feature based sequence kernels and SVM , 2008, INTERSPEECH.

[32]  Sun-Yuan Kung,et al.  Blind Stochastic Feature Transformation for Channel Robust Speaker Verification , 2006, J. VLSI Signal Process..

[33]  Douglas A. Reynolds,et al.  Using prosodic and conversational features for high-performance speaker recognition: report from JHU WS'02 , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34]  Sun-Yuan Kung,et al.  Machine learning for multimodality genomic signal processing , 2006 .

[35]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[36]  James T. Kwok,et al.  Embedded kernel eigenvoice speaker adaptation and its implication to reference speaker weighting , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[38]  Sun-Yuan Kung,et al.  Probabilistic feature-based transformation for speaker verification over telephone networks , 2007, Neurocomputing.

[39]  T. Philipp,et al.  Quicknet on Multispert: Fast Parallel Neural Network , 1997 .

[40]  Biing-Hwang Juang,et al.  Speaker recognition based on minimum error discriminative training , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Sadaoki Furui,et al.  Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Man-Wai Mak,et al.  A new adaptation approach to high-level speaker-model creation in speaker verification , 2009, Speech Commun..

[43]  Eric Chang,et al.  Comparison of discriminative training methods for speaker verification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[44]  H. Sussman,et al.  The Effect of Speaking Style on a Locus Equation Characterization of Stop Place of Articulation , 1998, Phonetica.

[45]  S. Kung,et al.  A probabilistic DBNN with applications to sensor fusion and object recognition , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[46]  Douglas A. Reynolds,et al.  HTIMIT and LLHDB: speech corpora for the study of handset transducer effects , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[47]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[48]  Delphine Charlet,et al.  Prosodic parameter for speaker identification , 2002, INTERSPEECH.

[49]  Sadaoki Furui,et al.  Recent advances in speaker recognition , 1997, Pattern Recognit. Lett..

[50]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[51]  Aaron E. Rosenberg,et al.  Speaker verification using minimum verification error training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[52]  Qin Jin,et al.  Phonetic speaker recognition using maximum-likelihood binary-decision tree models , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[53]  Sun-Yuan Kung,et al.  Biometric Authentication: A Machine Learning Approach , 2004 .

[54]  Douglas A. Reynolds,et al.  Combining cross-stream and time dimensions in phonetic speaker recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[55]  Michael J. Carey,et al.  Robust prosodic features for speaker identification , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[56]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[57]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[58]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[59]  George R. Doddington,et al.  Speaker recognition based on idiolectal differences between speakers , 2001, INTERSPEECH.

[60]  Douglas A. Reynolds,et al.  Fusing high- and low-level features for speaker recognition , 2003, INTERSPEECH.

[61]  Man-Wai Mak,et al.  A New Adaptation Method for Speaker-Model Creation in High-Level Speaker Verification , 2007, PCM.