论文信息 - Weakly Supervised Multi-Embeddings Learning of Acoustic Models

Weakly Supervised Multi-Embeddings Learning of Acoustic Models

We trained a Siamese network with multi-task same/different information on a speech dataset, and found that it was possible to share a network for both tasks without a loss in performance. The first task was to discriminate between two same or different words, and the second was to discriminate between two same or different talkers.

Emmanuel Dupoux | Gabriel Synnaeve

[1] Aren Jansen,et al. Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[2] Michael I. Jordan,et al. Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[3] Yann LeCun,et al. Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[4] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5] James R. Glass,et al. Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Elizabeth K. Johnson,et al. Infant ability to tell voices apart rests on language experience. , 2011, Developmental science.

[7] James L. McClelland,et al. Unsupervised learning of vowel categories from infant-directed speech , 2007, Proceedings of the National Academy of Sciences.

[8] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[9] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10] Kenneth Ward Church,et al. Towards spoken term discovery at scale with zero resources , 2010, INTERSPEECH.

[11] D. Swingley,et al. At 6–9 months, human infants know the meanings of many common nouns , 2012, Proceedings of the National Academy of Sciences.

[12] Aren Jansen,et al. Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[13] Nicholas W. D. Evans,et al. Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Geoffrey E. Hinton,et al. Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Emmanuel Dupoux,et al. Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).