Data augmentation with moment-matching networks for i-vector based speaker verification

This paper proposes an i-vector generation scheme with conditional generative moment-matching networks (MMNs) for speaker verification. In this scheme, multiple i-vectors for each enrollment speaker are randomly generated from trained MMNs and noise distributions. The randomly generated i-vectors are assumed to represent diverse variations for each enrollment speaker. Since this paper is aim to provide new possibility of the i-vector augmentation with MMNs, i-vector-based preliminary speaker verification evaluation with support vector machine (SVM) are performed. The results of SVM classification show that the generated i-vectors are contributed for estimation of the accurate SVM classifiers of enrollment speakers. From the experimental results, we also compare the distributions of the generated i-vectors with those of the original ones and discuss them.

[1]  Amos J. Storkey,et al.  Data Augmentation Generative Adversarial Networks , 2017, ICLR 2018.

[2]  Jun Zhu,et al.  Conditional Generative Moment-Matching Networks , 2016, NIPS.

[3]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[4]  Tomoki Toda,et al.  A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models , 2016, IEICE Trans. Inf. Syst..

[5]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[6]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[7]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[8]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[9]  Sridha Sridharan,et al.  Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques , 2014, Speech Commun..

[10]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[11]  Abeer Alwan,et al.  CNN-Based Joint Mapping of Short and Long Utterance i-Vectors for Speaker Verification Using Short Utterances , 2017, INTERSPEECH.

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Heiga Zen,et al.  Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Seongkyu Mun,et al.  GENERATIVE ADVERSARIAL NETWORK BASED ACOUSTIC SCENE TRAINING SET AUGMENTATION AND SELECTION USING SVM HYPERPLANE , 2017 .

[15]  Junichi Yamagishi,et al.  Voice liveness detection algorithms based on pop noise caused by human breath for automatic speaker verification , 2015, INTERSPEECH.

[16]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[17]  Tomoki Koriyama,et al.  Sampling-Based Speech Parameter Generation Using Moment-Matching Networks , 2017, INTERSPEECH.

[18]  Frank K. Soong,et al.  DNN i-Vector Speaker Verification with Short, Text-Constrained Test Utterances , 2017, INTERSPEECH.

[19]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[20]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[22]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Brian Kingsbury,et al.  End-to-end ASR-free keyword search from speech , 2017, ICASSP.

[25]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[26]  James R. Glass,et al.  Cosine Similarity Scoring without Score Normalization Techniques , 2010, Odyssey.

[27]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .