Reconnaissance et transformation de locuteurs

This PhD thesis tries to understand how to analyse, decompose, model and transform the vocal identity of a human when seen through an automatic speaker recognition application. It starts with an introduction explaining the properties of the speech signal and the basis of the automatic speaker recognition. Then, the errors of an operating speaker recognition application are analysed. From the deficiencies and mistakes noticed in the running application, some observations cm be made which will imply a re-evaluation of the characteristic parameters of a speaker, and to reconsider some parts of the automatic speaker recognition chain. In order to determine what are the characterising parameters of a speaker, these are extracted from the speech signal with an analysis and synthesis harmonic plus noise model (H+N). The analysis and re-synthesis of the harmonic and noise parts indicate those which are speech or speaker dependent. It is then shown that the speaker discriminating information can be found in the residual of the subtraction from the original signal of the H+N modeled signal. Then, a study of the impostors phenomenon, essential in the tuning of a speaker recognition system, is carried out. The impostors are simulated in two ways: first by a transformation of the speech of a source speaker (the impostor) to the speech of a target speaker (the client) using the parameters extracted from the H+N model. This way of transforming the parameters is efficient as the false acceptance rate grows from 4% to 23%. Second, an automatic imposture by speech sepent concatenation is carried out. In this case the false acceptance rate grows to 30%. A way to become less sensitive to the spectral modification impostures is to remove the harmonic part or even the noise part modeled by the H+N from the original signal. Using such a subtraction decreases the false acceptance rate to 8% even if transformed impostors are used. To overcome the lack of training data — one of the main cause of modeling errors in speaker recognition — a decomposition of the recognition task into a set of binary classifiers is proposed. A classifier matrix is built and each of its elements has to classify word by word the data coming from the client and another speaker (named here an anti-speaker, randomly chosen from an extemal database). With such an approach it is possible to weight the results according to the vocabulary or the neighbours of the client in the parameter (acoustic) space. The output of the mamx classifiers are then weighted and mixed in order to produce a single output score. The weights are estimated on validation data, and if the weighting is done properly, the binary pair speaker recognition system gives better results than a state of the an HMM based system. In order to set a point of operation (i.e. a point on the COR cuwe) for the speaker recognition application, an a priori threshold has to be determined. Theoretically the threshold should be speaker independent when stochastic models are used. However, practical experiments show that this is not the case, as due to modeling mismatch the threshold becomes speaker and utterance length dependant. A theoretical framework showing how to adjust the threshold using the local likelihood ratio is then developed. Finally, a last modeling error correction method using decision fusion is proposed. Some practical experiments show the advantages and drawbacks of the fusion approach in speaker recognition applications.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  O. Cappé,et al.  Regularization techniques for discrete cepstrum estimation , 1996, IEEE Signal Processing Letters.

[3]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[4]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[5]  Thomas Jacobs,et al.  Results of a speaker verification service trial using HMM models , 1995, EUROSPEECH.

[6]  Mohammad Mehdi Homayounpour Verification vocale d'identite : dependante et independante du texte , 1995 .

[7]  Mark E. Forsyth Discriminating observation probability (DOP) HMM for speaker verification , 1995, Speech Commun..

[8]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[9]  Steven K. Rogers,et al.  Auditory model representation for speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Biing-Hwang Juang,et al.  The use of cohort normalized scores for speaker verification , 1992, ICSLP.

[11]  Frédéric Bimbot,et al.  Techniques for a priori decision threshold estimation in speaker verification , 1998 .

[12]  Douglas A. Reynolds,et al.  Integrated models of signal and background with application to speaker identification in noise , 1994, IEEE Trans. Speech Audio Process..

[13]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[14]  Richard J. Mammone,et al.  New LP-derived features for speaker identification , 1994, IEEE Trans. Speech Audio Process..

[15]  Parcor Coeff,et al.  Comparison of Speaker Recognition Methods Using Statistical Features and Dynamic Features , 1981 .

[16]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[17]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[18]  J. Pierrot Elaboration et validation d'approches en verification du locuteur , 1998 .

[19]  Hervé Bourlard,et al.  An introduction to the hybrid hmm/connectionist approach , 1995 .

[20]  Belur V. Dasarathy,et al.  Decision fusion , 1994 .

[21]  D. Klatt,et al.  Analysis, synthesis, and perception of voice quality variations among female and male talkers. , 1990, The Journal of the Acoustical Society of America.

[22]  Sadaoki Furui,et al.  A study of speaker adaptation based on minimum classification error training , 1995, EUROSPEECH.

[23]  Patrick Corsi Speaker Recognition: A Survey , 1982 .

[24]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[25]  K. Paliwal,et al.  Quantization of LPC Parameters , 2022 .

[26]  David A. Landgrebe,et al.  Covariance Matrix Estimation and Classification With Limited Training Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Eddy Mayoraz,et al.  Improved Pairwise Coupling Classification with Correcting Classifiers , 1998, ECML.

[29]  Dominique Genoud,et al.  Semi-automatic HMM-based annotation of the PolyCOST Database , 1996 .

[30]  Juergen Luettin,et al.  Integrating acoustic and labial information for speaker identification and verification , 1997, EUROSPEECH.

[31]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[32]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[33]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Gérard Chollet,et al.  Voice transformation, a tool for imposture of speaker verification , 1998 .

[35]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[36]  H. Wakita Estimation of vocal-tract shapes from acoustical analysis of the speech wave: The state of the art , 1979 .

[37]  Dominique Genoud,et al.  A comparison of a priori threshold setting procedures for speaker verification in the CAVE project , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[38]  Johan Schalkwyk,et al.  Detecting an imposter in telephone speech , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  M. Basseville Distance measures for signal processing and pattern recognition , 1989 .

[40]  Sadaoki Furui Speaker-dependent-feature extraction, recognition and processing techniques , 1991, Speech Commun..

[41]  Sridha Sridharan,et al.  Telephone based speaker recognition using multiple binary classifier and Gaussian mixture models , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Yi-Teh Lee,et al.  Information-theoretic distortion measures for speech recognition , 1991, IEEE Trans. Signal Process..

[43]  Gérard Chollet,et al.  Combining methods to improve speaker verification decision , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[44]  Qiguang Lin A fast algorithm for computing the vocal-tract impulse response from the transfer function , 1995, IEEE Trans. Speech Audio Process..

[45]  Yannis Stylianou,et al.  On the transformation of the speech spectrum for voice conversion , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[46]  John Oglesby What's in a number? Moving beyond the equal error rate , 1995, Speech Commun..

[47]  Jean-François Bonastre,et al.  Subband Approach for Automatic Speaker Recognition: Optimal Division of the Frequency Domain , 1997, AVBPA.

[48]  Richard T. Antony,et al.  Principles of Data Fusion Automation , 1995 .

[49]  Sadaoki Furui,et al.  Likelihood normalization for speaker verification using a phoneme- and speaker-independent model , 1995, Speech Commun..

[50]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[51]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[52]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[53]  Sadaoki Furui,et al.  Speaker adaptation of tied-mixture-based phoneme models for text-prompted speaker recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  Lawrence R. Rabiner,et al.  Connected digit recognition using a level-building DTW algorithm , 1981 .

[55]  Sadaoki Furui,et al.  Speaker recognition using concatenated phoneme models , 1992, ICSLP.

[56]  Tomoko Matsui,et al.  Distance measures for text-independent speaker recognition based on MAR model , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Dominique Genoud,et al.  Text dependent speaker verification using binary classifiers , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[58]  K. P. Li,et al.  An approach to text-independent speaker recognition with short utterances , 1983, ICASSP.

[59]  Jian Su,et al.  Speaker recognition with temporal transition models , 1995, EUROSPEECH.

[60]  Sadaoki Furui,et al.  Concatenated phoneme models for text-variable speaker recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[61]  Gérard Chollet,et al.  Speech pre-processing against intentional imposture in speaker recognition , 1998, ICSLP.

[62]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[63]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[64]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[65]  Sadaoki Furui,et al.  An Overview of Speaker Recognition Technology , 1996 .

[66]  Aaron E. Rosenberg,et al.  Connected word talker verification using whole word hidden Markov models , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[67]  Haizhou Li,et al.  Some nonparametric distance measures in speaker verification , 1995, EUROSPEECH.

[68]  Dominique Genoud,et al.  POLYCOST: A telephone-speech database for speaker recognition , 2000, Speech Commun..

[69]  P. Thevenaz Reconnaissance de locuteurs indépendante du texte , 1990 .

[70]  Gérard Chollet,et al.  Secured vocal access to telephone servers , 1996, Proceedings of IVTTA '96. Workshop on Interactive Voice Technology for Telecommunications Applications.

[71]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[72]  Jay M. Naik,et al.  A hybrid HMM-MLP speaker verification algorithm for telephone speech , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[73]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[74]  Jae S. Lim,et al.  Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[75]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[76]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[77]  A. Oppenheim,et al.  Nonlinear filtering of multiplied and convolved signals , 1968 .

[78]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .