The use of long-term features for GMM- and i-vector-based speaker diarization systems

Several factors contribute to the performance of speaker diarization systems. For instance, the appropriate selection of speech features is one of the key aspects that affect speaker diarization systems. The other factors include the techniques employed to perform both segmentation and clustering. While the static mel frequency cepstral coefficients are the most widely used features in speech-related tasks including speaker diarization, several studies have shown the benefits of augmenting regular speech features with the static ones.In this work, we have proposed and assessed the use of voice-quality features (i.e., jitter, shimmer, and Glottal-to-Noise Excitation ratio) within the framework of speaker diarization. These acoustic attributes are employed together with the state-of-the-art short-term cepstral and long-term prosodic features. Additionally, the use of delta dynamic features is also explored separately both for segmentation and bottom-up clustering sub-tasks. The combination of the different feature sets is carried out at several levels. At the feature level, the long-term speech features are stacked in the same feature vector. At the score level, the short- and long-term speech features are independently modeled and fused at the score likelihood level.Various feature combinations have been applied both for Gaussian mixture modeling and i-vector-based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction meeting corpus. The best result, in terms of diarization error rate, is reported by using i-vector-based cosine-distance clustering together with a signal parameterization consisting of a combination of static cepstral coefficients, delta, voice-quality, and prosodic features. The best result shows about 24% relative diarization error rate improvement compared to the baseline system which is based on Gaussian mixture modeling and short-term static cepstral coefficients.

[1]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[2]  Mark Huckvale,et al.  How Is Individuality Expressed in Voice? An Introduction to Speech Production and Description for Speaker Classification , 2007, Speaker Classification.

[3]  Paul Boersma,et al.  Praat: doing phonetics by computer , 2003 .

[4]  Jordi Luque,et al.  Using voice-quality measurements with prosodic and spectral features for speaker diarization , 2015, INTERSPEECH.

[5]  Sree Harsha Yella,et al.  Speaker diarization of spontaneous meeting room conversations , 2015 .

[6]  Jordi Luque,et al.  Jitter and Shimmer Measurements for Speaker Diarization , 2014 .

[7]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[8]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[9]  James R. Glass,et al.  On the Use of Spectral and Iterative Methods for Speaker Diarization , 2012, INTERSPEECH.

[10]  Christian A. Müller,et al.  Prosodic and other Long-Term Features for Speaker Diarization , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  M.M. Homayounpour,et al.  Speaker age interval and sex identification based on Jitters, Shimmers and Mean MFCC using supervised and unsupervised discriminative classification methods , 2006, 2006 8th international Conference on Signal Processing.

[12]  Jordi Luque Serrano Speaker diarization and tracking in multiple-sensor environments , 2012 .

[13]  Hans Werner Strube,et al.  Glottal-to-Noise Excitation Ratio - a New Measure for Describing Pathological Voices , 1997 .

[14]  D Michaelis,et al.  Selection and combination of acoustic features for the description of pathologic voices. , 1998, The Journal of the Acoustical Society of America.

[15]  André Adami,et al.  Modeling prosodic differences for speaker recognition , 2007, Speech Commun..

[16]  Phuoc Nguyen Automatic Speaker Classification Based on Voice Characteristics , 2011 .

[17]  Jan Silovský,et al.  Speaker diarization using PLDA-based speaker clustering , 2011, Proceedings of the 6th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems.

[18]  Jordi Luque,et al.  Short- and Long-Term Speech Features for Hybrid HMM-i-Vector based Speaker Diarization System , 2016, Odyssey.

[19]  Jordi Luque,et al.  Improving i-Vector and PLDA Based Speaker Clustering with Long-Term Features , 2016, INTERSPEECH.

[20]  Richard M. Stern,et al.  Delta-spectral cepstral coefficients for robust speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Haizhou Li,et al.  ALIZE 3.0 - open source toolkit for state-of-the-art speaker recognition , 2013, INTERSPEECH.

[22]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[23]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[24]  Jody Kreiman,et al.  Perception of aperiodicity in pathological voice. , 2005, The Journal of the Acoustical Society of America.

[25]  Xavier Anguera Miró,et al.  Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information , 2007, IEEE Transactions on Computers.

[26]  Goutam Saha,et al.  Performance comparison of speaker recognition systems in presence of duration variability , 2015, 2015 Annual IEEE India Conference (INDICON).

[27]  Javier Hernando,et al.  The Detection of Overlapping Speech with Prosodic Features for Speaker Diarization , 2011, INTERSPEECH.

[28]  Mireia Farrús,et al.  Jitter and shimmer measurements for speaker recognition , 2007, INTERSPEECH.

[29]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[31]  Alfonso Ortega Giménez,et al.  Robust diarization for speaker characterization (Diarización robusta para caracterización de locutores) , 2012 .

[32]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[33]  Jean-François Bonastre,et al.  Step-by-step and integrated approaches in broadcast news speaker diarization , 2006, Comput. Speech Lang..

[34]  Margaret Lech,et al.  Speaker Verification Based on Different Vector Quantization Techniques with Gaussian Mixture Models , 2009, 2009 Third International Conference on Network and System Security.

[35]  Jordi Luque,et al.  On the fusion of prosody, voice spectrum and face features for multimodal person verification , 2006, INTERSPEECH.

[36]  Eduardo Lleida,et al.  Variational Bayesian PLDA for speaker diarization in the MGB challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[37]  Jan Silovský,et al.  Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Themos Stafylakis,et al.  Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Pedro Gómez-Vilda,et al.  The effectiveness of the glottal to noise excitation ratio for the screening of voice disorders. , 2010, Journal of voice : official journal of the Voice Foundation.

[40]  Pedro Gómez Vilda,et al.  Screening voice disorders with the glottal to noise excitation ratio , 2009 .

[41]  Xi Li,et al.  Stress and Emotion Classification using Jitter and Shimmer Features , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[42]  Helen C. Shen,et al.  Multiple hypothesis testing fusion method for multisensor systems , 1999, Proceedings 1999 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human and Environment Friendly Robots with High Intelligence and Emotional Quotients (Cat. No.99CH36289).