Effective Speaker Retrieval and Recognition through Vector Quantization and Unsupervised Distance Learning

The huge amount of multimedia content accumulated daily has demanded the development of effective retrieval approaches. In this context, speaker recognition methods capable of automatically identifying a person through their voice is of great relevance. This paper presents a novel speaker recognition approach modelled in a retrieval scenario and using a recent unsupervised learning method. The proposed approach considers MFCC features and a Vector Quantization model to compute distances among audio objects. Next, a rank-based unsupervised learning method is used for improving the effectiveness of retrieval results. Several experiments were conducted considering three public datasets with different settings, such as background noise from diverse sources. Experimental results demonstrate that the proposed approach can achieve very high effectiveness results. In addition, effectiveness gains up to +27\% were obtained by the unsupervised learning procedure.

[1]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[3]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[5]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Patrick Kenny,et al.  An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech , 2010, Odyssey.

[7]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[8]  Juraj Simko,et al.  The CHAINS corpus: CHAracterizing INdividual Speakers , 2006 .

[9]  Ricardo da Silva Torres,et al.  Unsupervised Distance Learning by Rank Correlation Measures for Image Retrieval , 2015, ICMR.

[10]  Matthew Sharifi,et al.  Large-scale speaker identification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Tomi Kinnunen,et al.  Comparison of clustering methods: A case study of text-independent speaker modeling , 2011, Pattern Recognit. Lett..

[12]  Ricardo da Silva Torres,et al.  Image Re-ranking and Rank Aggregation Based on Similarity of Ranked Lists , 2011, CAIP.

[13]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[14]  M. A. Anusuya,et al.  Front end analysis of speech recognition: a review , 2011, Int. J. Speech Technol..

[15]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[16]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[17]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[18]  Sarel van Vuuren,et al.  Comparison of text-independent speaker recognition methods on telephone speech with acoustic mismatch , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[19]  Douglas A. Reynolds,et al.  The NIST 2014 Speaker Recognition i-vector Machine Learning Challenge , 2014, Odyssey.

[20]  Steve An Xue, Dimitar Deliyski EFFECTS OF AGING ON SELECTED ACOUSTIC VOICE PARAMETERS: PRELIMINARY NORMATIVE DATA AND EDUCATIONAL IMPLICATIONS , 2001 .