A Temporal Coherence Loss Function for Learning Unsupervised Acoustic Embeddings

Abstract We train neural networks of varying depth with a loss function which imposes the output representations to have a temporal profile which looks like that of phonemes. We show that a simple loss function which maximizes the dissimilarity between near frames and long distance frames helps to construct a speech embedding that improves phoneme discriminability, both within and across speakers, even though the loss function only uses within speaker information. However, with too deep an architecture, this loss function yields overfitting, suggesting the need for more data and/or regularization.

[1]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[2]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[4]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[5]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[6]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[7]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[8]  Emmanuel Dupoux,et al.  Weakly Supervised Multi-Embeddings Learning of Acoustic Models , 2015, ICLR.

[9]  Richard M. Stern,et al.  Towards machines that know when they do not know: Summary of work done at 2014 Frederick Jelinek Memorial Workshop , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Ewan Dunbar,et al.  A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling , 2015, INTERSPEECH.

[12]  N. Umeda Consonant duration in American English , 1977 .

[13]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[16]  Hynek Hermansky,et al.  Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[19]  Hynek Hermansky,et al.  Evaluating speech features with the minimal-pair ABX task (II): resistance to noise , 2014, INTERSPEECH.

[20]  Emmanuel Dupoux,et al.  Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[21]  Herbert Gish,et al.  Unsupervised training of an HMM-based self-organizing unit recognizer with applications to topic classification and keyword discovery , 2014, Comput. Speech Lang..

[22]  N. Umeda Vowel duration in American English. , 1975, The Journal of the Acoustical Society of America.

[23]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[24]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.