论文信息 - A New Unsupervised Short-Utterance based Speaker Identification Approach with Parametric t-SNE Dimensionality Reduction

A New Unsupervised Short-Utterance based Speaker Identification Approach with Parametric t-SNE Dimensionality Reduction

State-of-the-art speaker identification (SI) systems have achieved accuracies of 100% with long-duration utterances which are impractical. Recently, short-utterance based systems have gained attention although identification rates are lower. This paper presents an approach for text-dependent speaker phoneme-based (<1sec) SI with parametric t-distributed stochastic neighbor embedding (pt-SNE) for dimensionality reduction of features to provide 3D-visualization. The approach employs Gaussian mixture model enhanced by $K-means++$ and gap statistic methods. As there is no other similar work, a fair comparison is unavailable. The 75% rate achieved is comparable to other works using (i) short-utterances (ii) pt-SNE for recognition of other data types.

Roselina Arelhi | Omar Elnaggar | Omar Elnaggar | Roselina Arelhi

[1] Barbara Hammer,et al. Parametric nonlinear dimensionality reduction using kernel t-SNE , 2015, Neurocomputing.

[2] Miguel Á. Carreira-Perpiñán,et al. Entropic Affinities: Properties and Efficient Numerical Computation , 2013, ICML.

[3] Douglas A. Reynolds,et al. Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[4] Douglas A. Reynolds,et al. A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[5] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[6] R. Fisher. THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[7] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[8] Ke Chen,et al. Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[9] Michael C. Hout,et al. Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[10] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[11] Jean Hennebert,et al. Text-prompted speaker verification experiments with phoneme specific MLPs , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[12] Karl Pearson F.R.S.. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[13] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.

[14] Douglas A. Reynolds,et al. Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[15] DeLiang Wang,et al. Analyzing noise robustness of MFCC and GFCC features in speaker identification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16] Seyed Reza Shahamiri,et al. A Deep Autoencoder approach for Speaker Identification , 2017, ICSPS 2017.

[17] Sridha Sridharan,et al. Factor analysis modelling for speaker verification with short utterances , 2008, Odyssey.

[18] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[19] Krishnakumar Balasubramanian,et al. Dimensionality Reduction for Text using Domain Knowledge , 2010, COLING.

[20] Boudewijn P F Lelieveldt,et al. Data-driven identification of prognostic tumor subpopulations using spatially mapped t-SNE of mass spectrometry imaging data , 2016, Proceedings of the National Academy of Sciences.

[21] Eric O. Postma,et al. Dimensionality Reduction: A Comparative Review , 2008 .

[22] S. Pruzansky. Pattern‐Matching Procedure for Automatic Talker Recognition , 1963 .

[23] K. Sarmah. Comparison Studies of Speaker Modeling Techniques in Speaker Verification System , 2017 .

[24] Haizhou Li,et al. UBM data selection for effective speaker modeling , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[25] Jennifer Urner. Forensic Speaker Identification , 2016 .

[26] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[27] G. McLachlan. Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[28] J. Wolf. Efficient Acoustic Parameters for Speaker Recognition , 1972 .

[29] L. H. Anauer,et al. Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[30] Satish Kumar Jain,et al. Neural networks : a classroom approach , 2005 .

[31] Pedro Gómez-Vilda,et al. Improving Speaker Recognition by Biometric Voice Deconstruction , 2015, Front. Bioeng. Biotechnol..

[32] N. E. Day. Estimating the components of a mixture of normal distributions , 1969 .

[33] Huaiyu Zhu. On Information and Sufficiency , 1997 .

[34] S. Whiteside,et al. Sex-specific fundamental and formant frequency patterns in a cross-sectional study. , 2001, The Journal of the Acoustical Society of America.

[35] Nasser M. Nasrabadi,et al. Text-Independent Speaker Verification Using 3D Convolutional Neural Networks , 2017, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[36] Laurens van der Maaten,et al. Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[37] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[38] Thomas Fang Zheng,et al. Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39] Honglak Lee,et al. Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[40] Aaron E. Rosenberg,et al. New techniques for automatic speaker verification , 1975 .

[41] Oliver Durr,et al. Speaker identification and clustering using convolutional neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[42] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[43] Goutam Saha,et al. Speaker verification with short utterances: a review of challenges, trends and opportunities , 2017, IET Biom..

[44] Biing-Hwang Juang,et al. A vector quantization approach to speaker recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45] J. E. Dammann,et al. Experimental Studies in Speaker Verification, Using an Adaptive System , 1966 .

[46] 杨浩,et al. Advances in SVM-Based System Using GMM Super Vectors for Text-Independent Speaker Verification , 2008 .

[47] Andy Harter,et al. Parameterisation of a stochastic model for human face identification , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[48] S T Roweis,et al. Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[49] Douglas A. Reynolds,et al. Experimental evaluation of features for robust speaker identification , 1994, IEEE Trans. Speech Audio Process..

[50] Tsuyoki Nishikawa,et al. I-vector-based speaker identification with extremely short utterances for both training and testing , 2017, 2017 IEEE 6th Global Conference on Consumer Electronics (GCCE).

[51] Andrzej Drygajlo,et al. Speaker verification in noisy environments with combined spectral subtraction and missing feature theory , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[52] Xingpeng Jiang,et al. Visualization of genetic disease-phenotype similarities by multiple maps t-SNE with Laplacian regularization , 2014, BMC Medical Genomics.

[53] DeLiang Wang,et al. Incorporating Auditory Feature Uncertainties in Robust Speaker Identification , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[54] H Hermansky,et al. Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[55] John W. Sammon,et al. A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[56] Geoffrey E. Hinton,et al. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure , 2007, AISTATS.

[57] J.P. Eatock,et al. A quantitative assessment of the relative speaker discriminating properties of phonemes , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[58] Joaquin Gonzalez-Rodriguez,et al. Evaluating Automatic Speaker Recognition systems: An overview of the NIST Speaker Recognition Evaluations (1996-2014) , 2014 .

[59] Mondher Frikha,et al. New approach for short utterance speaker identification , 2018, IET Signal Process..

[60] M. Savic,et al. Phoneme based speaker verification , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[61] Theodoros Giannakopoulos,et al. Introduction to Audio Analysis: A MATLAB® Approach , 2014 .

[62] S. Chiba,et al. Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[63] Pavel Senin,et al. Dynamic Time Warping Algorithm Review , 2008 .

[64] Joaquín González-Rodríguez. Speaker Recognition Using Temporal Contours in Linguistic Units: The Case of Formant and Formant-Bandwidth Trajectories , 2011, INTERSPEECH.

[65] Tong Li,et al. GMM and CNN Hybrid Method for Short Utterance Speaker Recognition , 2018, IEEE Transactions on Industrial Informatics.

[66] D. O'Shaughnessy,et al. Linear predictive coding , 1988, IEEE Potentials.

[67] Shrikanth S. Narayanan,et al. Robust speaker identification based on selective use of feature vectors , 2007, Pattern Recognit. Lett..

[68] Geoffrey E. Hinton,et al. Stochastic Neighbor Embedding , 2002, NIPS.

[69] Bernd Freisleben,et al. Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[70] Jeremy G Todd,et al. Systematic exploration of unsupervised methods for mapping behavior , 2016, bioRxiv.

[71] Pedro Univaso. Forensic Speaker Identification: a tutorial , 2017, IEEE Latin America Transactions.

[72] Samik Raychaudhuri,et al. Introduction to Monte Carlo simulation , 2008, 2008 Winter Simulation Conference.

[73] Goutam Saha,et al. Performance comparison of speaker recognition systems in presence of duration variability , 2015, 2015 Annual IEEE India Conference (INDICON).

[74] Matthew Sharifi,et al. Large-scale speaker identification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).