Double articulation analyzer with deep sparse autoencoder for unsupervised word discovery from speech signals

Direct word discovery from audio speech signals is a very difficult and challenging problem for a developmental robot. Human infants are able to discover words directly from speech signals, and, to understand human infants’ developmental capability using a constructive approach, it is very important to build a machine learning system that can acquire knowledge about words and phonemes, i.e. a language model and an acoustic model, autonomously in an unsupervised manner. To achieve this, the nonparametric Bayesian double articulation analyzer (NPB-DAA) with the deep sparse autoencoder (DSAE) is proposed in this paper. The NPB-DAA has been proposed to achieve totally unsupervised direct word discovery from speech signals. However, the performance was still unsatisfactory, although it outperformed pre-existing unsupervised learning methods. In this paper, we integrate the NPB-DAA with the DSAE, which is a neural network model that can be trained in an unsupervised manner, and demonstrate its performance through an experiment about direct word discovery from auditory speech signals. The experiment shows that the combined method, the NPB-DAA with the DSAE, outperforms pre-existing unsupervised learning methods, and shows state-of-the-art performance. It is also shown that the proposed method outperforms several standard speech recognizer-based methods with true word dictionaries. Graphical Abstract

[1]  Tomoaki Nakamura,et al.  Online learning of concepts and words using multimodal LDA and hierarchical Pitman-Yor Language Model , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  E. Newport,et al.  WORD SEGMENTATION : THE ROLE OF DISTRIBUTIONAL CUES , 1996 .

[3]  Mikio Nakano,et al.  Learning Place-Names from Spoken Utterances and Localization Results by Mobile Robot , 2011, INTERSPEECH.

[4]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[5]  Tadahiro Taniguchi,et al.  Semiotic prediction of driving behavior using unsupervised double articulation analyzer , 2012, 2012 IEEE Intelligent Vehicles Symposium.

[6]  Masaki Ogino,et al.  Cognitive Developmental Robotics: A Survey , 2009, IEEE Transactions on Autonomous Mental Development.

[7]  M. Cugmas,et al.  On comparing partitions , 2015 .

[8]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[9]  Tadahiro Taniguchi,et al.  Contextual scene segmentation of driving behavior based on double articulation analyzer , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Erik D. Thiessen,et al.  When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. , 2003, Developmental psychology.

[11]  Tadahiro Taniguchi,et al.  Automatic drive annotation via multimodal latent topic model , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[13]  Tadahiro Taniguchi,et al.  Double articulation analyzer for unsegmented human motion using Pitman-Yor language model and infinite hidden Markov model , 2011, 2011 IEEE/SICE International Symposium on System Integration (SII).

[14]  Yee Whye Teh,et al.  A Hierarchical Bayesian Language Model Based On Pitman-Yor Processes , 2006, ACL.

[15]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  James R. Glass,et al.  Unsupervised Lexicon Discovery from Acoustic Input , 2015, TACL.

[17]  Tadahiro Taniguchi,et al.  Drive video summarization based on double articulation structure of driving behavior , 2012, ACM Multimedia.

[18]  Mark Johnson,et al.  Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars , 2009, NAACL.

[19]  Nobuaki Minematsu,et al.  Free software toolkit for Japanese large vocabulary continuous speech recognition , 2000, INTERSPEECH.

[20]  Baobao Chang,et al.  A Joint Model for Unsupervised Chinese Word Segmentation , 2014, EMNLP.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Arindam Banerjee,et al.  A Spectral Algorithm for Inference in Hidden semi-Markov Models , 2015, AISTATS.

[23]  Tadahiro Taniguchi,et al.  Essential feature extraction of driving behavior using a deep learning method , 2015, 2015 IEEE Intelligent Vehicles Symposium (IV).

[24]  T. Taniguchi,et al.  Finding meaningful robust chunks from driving behavior based on double articulation analyzer , 2012, 2012 IEEE/SICE International Symposium on System Integration (SII).

[25]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[26]  Satoshi Nakamura,et al.  Unsupervised determination of efficient Korean LVCSR units using a Bayesian Dirichlet process model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tatsuya Kawahara,et al.  Bayesian Learning of a Language Model from Continuous Speech , 2012, IEICE Trans. Inf. Syst..

[28]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[29]  Tadahiro Taniguchi,et al.  Unsupervised Hierarchical Modeling of Driving Behavior and Prediction of Contextual Changing Points , 2015, IEEE Transactions on Intelligent Transportation Systems.

[30]  Aren Jansen,et al.  Fully unsupervised small-vocabulary speech recognition using a segmental Bayesian model , 2015, INTERSPEECH.

[31]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[32]  Tadahiro Taniguchi,et al.  Nonparametric Bayesian Double Articulation Analyzer for Direct Language Acquisition From Continuous Speech Signals , 2015, IEEE Transactions on Cognitive and Developmental Systems.

[33]  Tadahiro Taniguchi,et al.  Unsupervised drive topic finding from driving behavioral data , 2013, 2013 IEEE Intelligent Vehicles Symposium (IV).

[34]  Matthew J. Johnson,et al.  Stochastic Variational Inference for Bayesian Time Series Models , 2014, ICML.

[35]  Tadahiro Taniguchi,et al.  Visualization of driving behavior using deep sparse autoencoder , 2014, 2014 IEEE Intelligent Vehicles Symposium Proceedings.

[36]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[37]  Matthew J. Johnson,et al.  Bayesian nonparametric hidden semi-Markov models , 2012, J. Mach. Learn. Res..

[38]  J. Morgan,et al.  SIGNAL TO SYNTAX : Bootstrapping From Speech to Grammar in Early Acquisition , 2008 .

[39]  Sharon Goldwater,et al.  A role for the developing lexicon in phonetic category acquisition. , 2013, Psychological review.

[40]  Michael I. Jordan,et al.  A Sticky HDP-HMM With Application to Speaker Diarization , 2009, 0905.2592.

[41]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[42]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[43]  Benoît Sagot,et al.  Unsupervized Word Segmentation: the Case for Mandarin Chinese , 2012, ACL.

[44]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[45]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Richard N. Aslin,et al.  Models of Word Segmentation in Fluent Maternal Speech to Infants , 2014 .

[47]  Yu Zhang,et al.  Joint Learning of Phonetic Units and Word Pronunciations for ASR , 2013, EMNLP.

[48]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[49]  Chiori Hori,et al.  A cloud robotics approach towards dialogue-oriented robot speech , 2015, Adv. Robotics.

[50]  Tomoaki Nakamura,et al.  Mutual learning of an object concept and language model based on MLDA and NPYLM , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[51]  Tadahiro Taniguchi,et al.  Feature Extraction and Pattern Recognition for Human Motion by a Deep Sparse Autoencoder , 2014, 2014 IEEE International Conference on Computer and Information Technology.

[52]  Quoc V. Le,et al.  ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning , 2011, NIPS.