Unsupervised Tibetan speech features Learning based on Dynamic Bayesian Networks

This paper proposed an unsupervised learning method to learn speech features based on Dynamic Bayesian Networks (DBNs) that accounts for the spatiotemporal dependences in speech signal. Although deep networks have been successfully applied to unsupervised learning features, the structures of the deep networks are often fixed before learning and they fail to capture temporal representation. In this paper, we propose to construct DBNs for unsupervised learning spatial-temporal features from speech data. The experiment results on Tibetan speech data showed the features learned using proposed DBNs outperforms the state-of-art methods in word recognition accuracy.

[1]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[2]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[3]  Richard A. Harshman,et al.  Factor analysis of tongue shapes. , 1971, The Journal of the Acoustical Society of America.

[4]  P. Ladefoged,et al.  Factor analysis of tongue shapes. , 1971, Journal of the Acoustical Society of America.

[5]  Qiang Ji,et al.  Efficient Structure Learning of Bayesian Networks using Constraints , 2011, J. Mach. Learn. Res..

[6]  Kate Saenko,et al.  AN ASYNCHRONOUS DBN FOR AUDIO-VISUAL SPEECH RECOGNITION , 2006, 2006 IEEE Spoken Language Technology Workshop.

[7]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.