A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling

We report on an architecture for the unsupervised discovery of talker-invariant subword embeddings. It is made out of two components: a dynamic-time warping based spoken term discovery (STD) system and a Siamese deep neural network (DNN). The STD system clusters word-sized repeated fragments in the acoustic streams while the DNN is trained to minimize the distance between time aligned frames of tokens of the same cluster, and maximize the distance between tokens of different clusters. We use additional side information regarding the average duration of phonemic units, as well as talker identity tags. For evaluation we use the datasets and metrics of the Zero Resource Speech Challenge. The model shows improvement over the baseline in subword unit modeling.

[1]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[2]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[3]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[4]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Abdellah Fourtassi,et al.  Exploring the Relative Role of Bottom-up and Top-down Information in Phoneme Learning , 2014, ACL.

[6]  Sharon Goldwater,et al.  A role for the developing lexicon in phonetic category acquisition. , 2013, Psychological review.

[7]  P. Jusczyk The discovery of spoken language , 1997 .

[8]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Etienne Barnard,et al.  The NCHLT speech corpus of the South African languages , 2014, SLTU.

[10]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Emmanuel Dupoux,et al.  Phonetics embedding learning with side information , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[12]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[13]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[14]  Tetsuji Ogawa,et al.  A new efficient measure for accuracy prediction and its application to multistream-based unsupervised adaptation , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[15]  Aren Jansen,et al.  Efficient spoken term discovery using randomized algorithms , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Sharon Peperkamp,et al.  Learning Phonemes With a Proto-Lexicon , 2013, Cogn. Sci..

[18]  P. Kuhl A new view of language acquisition. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[20]  Emmanuel Dupoux,et al.  Weakly Supervised Multi-Embeddings Learning of Acoustic Models , 2015, ICLR.

[21]  Hynek Hermansky,et al.  Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  syhw abnet: interspeech 2015 status , 2015 .

[23]  Aren Jansen,et al.  The zero resource speech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[24]  Anil K. Jain,et al.  On-line signature verification, , 2002, Pattern Recognit..

[25]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.