A framework for speaker retrieval and identification through unsupervised learning

Abstract Speaker recognition is a task of remarkable relevance, with applications in diversified domains. Recently, mainly due to the facilities in audio-visual content acquisition, the capacity of analyzing growing datasets independent of labeled data has become a crucial advantage. This paper presents a speaker recognition approach based on recent unsupervised learning methods, which do not require any labeled data or user intervention. The approach is organized in terms of a framework which exploits a rank-based formulation. The similarity information defined by speaker modeling techniques is encoded in ranked lists, which are used as input by the unsupervised learning algorithms. Vector quantization, Gaussian mixture models and i-vectors are employed as modeling techniques, while the algorithms RL-Sim and ReckNN are used for unsupervised learning tasks. The framework was experimentally evaluated on query-by-example speaker retrieval and speaker identification tasks, both on clean and noisy speech recordings. An experimental evaluation was conducted on three public datasets, different languages, and recordings conditions. Effectiveness gains up to +56% on retrieval measures were obtained through the use of unsupervised learning algorithms over traditional speaker recognition techniques.

[1]  Daniel Carlos Guimarães Pedronette,et al.  Effective Speaker Retrieval and Recognition through Vector Quantization and Unsupervised Distance Learning , 2016, MARMI@ICMR.

[2]  Ricardo da Silva Torres,et al.  A correlation graph approach for unsupervised manifold learning in image retrieval tasks , 2016, Neurocomputing.

[3]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  Song Bai,et al.  Sparse Contextual Activation for Efficient Visual Re-Ranking , 2016, IEEE Transactions on Image Processing.

[5]  Daniel Carlos Guimarães Pedronette,et al.  Unsupervised manifold learning using Reciprocal kNN Graphs in image re-ranking and rank aggregation tasks , 2014, Image Vis. Comput..

[6]  Hagai Aronowitz,et al.  Efficient speaker identification and retrieval , 2005, INTERSPEECH.

[7]  Sridha Sridharan,et al.  A study of speaker clustering for speaker attribution in large telephone conversation datasets , 2016, Comput. Speech Lang..

[8]  Xuran Zhao,et al.  Unsupervised multi-view dimensionality reduction with application to audio-visual speaker retrieval , 2013, 2013 IEEE International Workshop on Information Forensics and Security (WIFS).

[9]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[10]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[12]  Muhammad Ghulam,et al.  Speaker recognition based on Arabic phonemes , 2017, Speech Commun..

[13]  Fred Cummins,et al.  Speaker Identification Using Instantaneous Frequencies , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Song Bai,et al.  Beyond diffusion process: Neighbor set similarity for fast re-ranking , 2015, Inf. Sci..

[15]  Ruili Wang,et al.  Speaker identification features extraction methods: A systematic review , 2017, Expert Syst. Appl..

[16]  Jurandy Almeida,et al.  Unsupervised Distance Learning for Plant Species Identification , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[17]  Paavo Alku,et al.  Vocal effort compensation for MFCC feature extraction in a shouted versus normal speaker recognition task , 2019, Comput. Speech Lang..

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Saeid Safavi,et al.  Automatic speaker, age-group and gender identification from children's speech , 2018, Comput. Speech Lang..

[20]  Jurandy Almeida,et al.  Unsupervised Manifold Learning for Video Genre Retrieval , 2014, CIARP.

[21]  Zhuowen Tu,et al.  Learning Context-Sensitive Shape Similarity by Graph Transduction , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  William M. Campbell,et al.  Graph-embedding for speaker recognition , 2010, INTERSPEECH.

[23]  Driss Aboutajdine,et al.  Organizing Gaussian mixture models into a tree for scaling up speaker retrieval , 2007, Pattern Recognit. Lett..

[24]  Jurandy Almeida,et al.  A scalable re-ranking method for content-based image retrieval , 2014, Inf. Sci..

[25]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[26]  John H. L. Hansen,et al.  Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification , 2015, Speech Commun..

[27]  David K. Burton,et al.  Text-dependent speaker verification using vector quantization source coding , 1985, IEEE Trans. Acoust. Speech Signal Process..

[28]  David A. van Leeuwen,et al.  Large-Scale Speaker Diarization for Long Recordings and Small Collections , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Zia Saquib,et al.  A Survey on Automatic Speaker Recognition Systems , 2010, FGIT-SIP/MulGraB.

[30]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[31]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[32]  Ricardo da Silva Torres,et al.  Image re-ranking and rank aggregation based on similarity of ranked lists , 2013, Pattern Recognit..

[33]  Tomi Kinnunen,et al.  Comparison of clustering methods: A case study of text-independent speaker modeling , 2011, Pattern Recognit. Lett..

[34]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[35]  Marijn Huijbregts,et al.  Towards automatic speaker retrieval for large multimedia archives , 2010, AIEMPro '10.