A New Unsupervised Short-Utterance based Speaker Identification Approach with Parametric t-SNE Dimensionality Reduction

State-of-the-art speaker identification (SI) systems have achieved accuracies of 100% with long-duration utterances which are impractical. Recently, short-utterance based systems have gained attention although identification rates are lower. This paper presents an approach for text-dependent speaker phoneme-based (<1sec) SI with parametric t-distributed stochastic neighbor embedding (pt-SNE) for dimensionality reduction of features to provide 3D-visualization. The approach employs Gaussian mixture model enhanced by $K-means++$ and gap statistic methods. As there is no other similar work, a fair comparison is unavailable. The 75% rate achieved is comparable to other works using (i) short-utterances (ii) pt-SNE for recognition of other data types.

[1]  Barbara Hammer,et al.  Parametric nonlinear dimensionality reduction using kernel t-SNE , 2015, Neurocomputing.

[2]  Miguel Á. Carreira-Perpiñán,et al.  Entropic Affinities: Properties and Efficient Numerical Computation , 2013, ICML.

[3]  Douglas A. Reynolds,et al.  Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[4]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[5]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[6]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[7]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[8]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[9]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[10]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[11]  Jean Hennebert,et al.  Text-prompted speaker verification experiments with phoneme specific MLPs , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[13]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[14]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[15]  DeLiang Wang,et al.  Analyzing noise robustness of MFCC and GFCC features in speaker identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Seyed Reza Shahamiri,et al.  A Deep Autoencoder approach for Speaker Identification , 2017, ICSPS 2017.

[17]  Sridha Sridharan,et al.  Factor analysis modelling for speaker verification with short utterances , 2008, Odyssey.

[18]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[19]  Krishnakumar Balasubramanian,et al.  Dimensionality Reduction for Text using Domain Knowledge , 2010, COLING.

[20]  Boudewijn P F Lelieveldt,et al.  Data-driven identification of prognostic tumor subpopulations using spatially mapped t-SNE of mass spectrometry imaging data , 2016, Proceedings of the National Academy of Sciences.

[21]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[22]  S. Pruzansky Pattern‐Matching Procedure for Automatic Talker Recognition , 1963 .

[23]  K. Sarmah Comparison Studies of Speaker Modeling Techniques in Speaker Verification System , 2017 .

[24]  Haizhou Li,et al.  UBM data selection for effective speaker modeling , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[25]  Jennifer Urner Forensic Speaker Identification , 2016 .

[26]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[27]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[28]  J. Wolf Efficient Acoustic Parameters for Speaker Recognition , 1972 .

[29]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[30]  Satish Kumar Jain,et al.  Neural networks : a classroom approach , 2005 .

[31]  Pedro Gómez-Vilda,et al.  Improving Speaker Recognition by Biometric Voice Deconstruction , 2015, Front. Bioeng. Biotechnol..

[32]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[33]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[34]  S. Whiteside,et al.  Sex-specific fundamental and formant frequency patterns in a cross-sectional study. , 2001, The Journal of the Acoustical Society of America.

[35]  Nasser M. Nasrabadi,et al.  Text-Independent Speaker Verification Using 3D Convolutional Neural Networks , 2017, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[36]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[37]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[38]  Thomas Fang Zheng,et al.  Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[40]  Aaron E. Rosenberg,et al.  New techniques for automatic speaker verification , 1975 .

[41]  Oliver Durr,et al.  Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[42]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[43]  Goutam Saha,et al.  Speaker verification with short utterances: a review of challenges, trends and opportunities , 2017, IET Biom..

[44]  Biing-Hwang Juang,et al.  A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  J. E. Dammann,et al.  Experimental Studies in Speaker Verification, Using an Adaptive System , 1966 .

[46]  杨浩,et al.  Advances in SVM-Based System Using GMM Super Vectors for Text-Independent Speaker Verification , 2008 .

[47]  Andy Harter,et al.  Parameterisation of a stochastic model for human face identification , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[48]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[49]  Douglas A. Reynolds,et al.  Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[50]  Tsuyoki Nishikawa,et al.  I-vector-based speaker identification with extremely short utterances for both training and testing , 2017, 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE).

[51]  Andrzej Drygajlo,et al.  Speaker verification in noisy environments with combined spectral subtraction and missing feature theory , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[52]  Xingpeng Jiang,et al.  Visualization of genetic disease-phenotype similarities by multiple maps t-SNE with Laplacian regularization , 2014, BMC Medical Genomics.

[53]  DeLiang Wang,et al.  Incorporating Auditory Feature Uncertainties in Robust Speaker Identification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[54]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[55]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[56]  Geoffrey E. Hinton,et al.  Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[57]  J.P. Eatock,et al.  A quantitative assessment of the relative speaker discriminating properties of phonemes , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Joaquin Gonzalez-Rodriguez,et al.  Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) , 2014 .

[59]  Mondher Frikha,et al.  New approach for short utterance speaker identification , 2018, IET Signal Process..

[60]  M. Savic,et al.  Phoneme based speaker verification , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[61]  Theodoros Giannakopoulos,et al.  Introduction to Audio Analysis: A MATLAB® Approach , 2014 .

[62]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[63]  Pavel Senin,et al.  Dynamic Time Warping Algorithm Review , 2008 .

[64]  Joaquín González-Rodríguez Speaker Recognition Using Temporal Contours in Linguistic Units: The Case of Formant and Formant-Bandwidth Trajectories , 2011, INTERSPEECH.

[65]  Tong Li,et al.  GMM and CNN Hybrid Method for Short Utterance Speaker Recognition , 2018, IEEE Transactions on Industrial Informatics.

[66]  D. O'Shaughnessy,et al.  Linear predictive coding , 1988, IEEE Potentials.

[67]  Shrikanth S. Narayanan,et al.  Robust speaker identification based on selective use of feature vectors , 2007, Pattern Recognit. Lett..

[68]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[69]  Bernd Freisleben,et al.  Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[70]  Jeremy G Todd,et al.  Systematic exploration of unsupervised methods for mapping behavior , 2016, bioRxiv.

[71]  Pedro Univaso Forensic Speaker Identification: a tutorial , 2017, IEEE Latin America Transactions.

[72]  Samik Raychaudhuri,et al.  Introduction to Monte Carlo simulation , 2008, 2008 Winter Simulation Conference.

[73]  Goutam Saha,et al.  Performance comparison of speaker recognition systems in presence of duration variability , 2015, 2015 Annual IEEE India Conference (INDICON).

[74]  Matthew Sharifi,et al.  Large-scale speaker identification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).