Singing speaker clustering based on subspace learning in the GMM mean supervector space

Highlights? Mixed style speech causes problems when training acoustic models for speech applications, such as speaker ID and ASR. ? This study is a first attempt for speaker clustering under mixed speaking styles which include reading and singing. ? Two types of subspace learning strategies in the GMM mean supervector space are studied: unsupervised and supervised. ? Advanced clustering algorithms are evaluated on a database that includes reading and singing the lyrics for each speaker. ? LPP subspace learning and a proposed cluster refining based on PLDA significantly improves clustering accuracies. In this study, we propose algorithms based on subspace learning in the GMM mean supervector space to improve performance of speaker clustering with speech from both reading and singing. As a speaking style, singing introduces changes in the time-frequency structure of a speaker's voice. The purpose of this study is to introduce advancements for speech systems such as speech indexing and retrieval which improve robustness to intrinsic variations in speech production. Speaker clustering techniques such as k-means and hierarchical are explored for analysis of acoustic space differences of a corpus consisting of reading and singing of lyrics for each speaker. Furthermore, a distance based on fuzzy c-means membership degrees is proposed to more accurately measure clustering difficulty or speaker confusability. Two categories of subspace learning methods are studied: unsupervised based on LPP, and supervised based on PLDA. Our proposed clustering method based on PLDA is a two stage algorithm: where first, initial clusters are obtained using full dimension supervectors, and next, each cluster is refined in a PLDA subspace resulting in a more speaker dependent representation that is less sensitive to speaking style. It is shown that LPP improves average clustering accuracy by 5.1% absolute versus a hierarchical baseline for a mixture of reading and singing, and PLDA based clustering increases accuracy by 9.6% absolute versus a k-means baseline. The advancements offer novel techniques to improve model formulation for speech applications including speaker ID, audio search, and audio content analysis.

[1]  Daben Liu,et al.  Speech and language technologies for audio indexing and retrieval , 2000, Proceedings of the IEEE.

[2]  John H. L. Hansen,et al.  Analysis and classification of speech mode: whispered through shouted , 2007, INTERSPEECH.

[3]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[4]  Hsin-Min Wang,et al.  Clustering speech utterances by speaker using Eigenvoice-motivated vector space models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[6]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[7]  Douglas A. Reynolds,et al.  A study of new approaches to speaker diarization , 2009, INTERSPEECH.

[8]  Thomas Fang Zheng,et al.  Study on speaker verification on emotional speech , 2006, INTERSPEECH.

[9]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[10]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[11]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[12]  E. A. Martin,et al.  Multi-style training for robust isolated-word speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  John H. L. Hansen,et al.  Speaker Clustering for a Mixture of Singing and Reading , 2012, INTERSPEECH.

[14]  DeLiang Wang,et al.  Separation of singing voice from music accompaniment for monaural recordings , 2007 .

[15]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[16]  Thomas S. Huang,et al.  Locality preserving speaker clustering , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[17]  John H. L. Hansen,et al.  Speaker identification for whispered speech based on frequency warping and score competition , 2008, INTERSPEECH.

[18]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Colleen Richey,et al.  Effects of vocal effort and speaking style on text-independent speaker verification , 2008, INTERSPEECH.

[20]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[21]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[22]  John H. L. Hansen,et al.  The Impact of Speech Under `Stress''on Military Speech Technology , 2000 .

[23]  Yuxiao Hu,et al.  Face recognition using Laplacianfaces , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Thomas S. Huang,et al.  Partially Supervised Speaker Clustering , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[26]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[27]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[28]  Hsin-Min Wang,et al.  Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics , 2004, Computer Music Journal.

[29]  Thomas S. Huang,et al.  Generative model-based speaker clustering via mixture of von Mises-Fisher distributions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  G. Ruske,et al.  Robust speaker clustering in eigenspace , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[31]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[32]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[33]  Frédéric Bimbot,et al.  Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs , 2004, INTERSPEECH.

[34]  John H. L. Hansen,et al.  Analysis and Compensation of Lombard Speech Across Noise Type and Levels With Application to In-Set/Out-of-Set Speaker Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  John H. L. Hansen,et al.  Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition , 1996, Speech Commun..

[36]  Thomas S. Huang,et al.  Fishervoice and semi-supervised speaker clustering , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Chi Zhang,et al.  Whisper-Island Detection Based on Unsupervised Segmentation With Entropy-Based Speech Feature Processing , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Rémi Gribonval,et al.  Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[40]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[41]  John H. L. Hansen,et al.  HMM-based stressed speech modeling with application to improved synthesis and recognition of isolated speech under stress , 1998, IEEE Trans. Speech Audio Process..