Generation of F0 contour using deep boltzmann machine and twin Gaussian process hybrid model for bengali language

In Text to Speech synthesis system F0 contour plays an important role in conveying prosodic information but the process of synthesizing F0 contour from the underlying linguistic information using deep architecture has not been investigated in case of Bengali languages. This paper describes a method for synthesizing F0 contours of Bengali readout speech from the textual features of input text using Deep Boltzmann Machine (DBM) and Twin Gaussian Process (TGP) hybrid model. DBM will capture the high-level linguistic structure of input text and improve the prediction accuracy when plugged into the TGP model. Unlike Gaussian Process (GP) models which only focus on the prediction of a single output (F0), TGP can generalize across multiple outputs (F0, delta F0, delta-delta F0) by encoding relations between both inputs and outputs with GP priors. The performance of the proposed method is evaluated and compared with other available methods using objective and perceptual listening tests and the results are found to be satisfactory.

[1]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[2]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[3]  Rui Wang,et al.  Discriminative Human Pose Estimation Based on the Bandelet2 Image Descriptor , 2011, 2011 Sixth International Conference on Image and Graphics.

[4]  Firoj Alam,et al.  Development of annotated Bangla speech corpora , 2010, SLTU.

[5]  Bhuvana Ramabhadran,et al.  F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[7]  K. Sreenivasa Rao,et al.  Intonation modeling using FFNN for syllable based Bengali text to speech synthesis , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).

[8]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[9]  Bayya Yegnanarayana,et al.  Intonation modeling for Indian languages , 2009, Comput. Speech Lang..

[10]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[11]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[12]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[13]  Hung-An Chang,et al.  Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hong Gu,et al.  Twin Gaussian Processes for Binary Classification , 2011, 2011 IEEE 11th International Conference on Data Mining.

[15]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[16]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[17]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[19]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System: HTS-2007 System for the Blizzard Challenge 2007 , 2007 .

[21]  Bo Xu,et al.  Investigation of deep Boltzmann machines for phone recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Shyamal Kumar Das Mandal,et al.  A Bengali HMM Based Speech Synthesis System , 2014, ArXiv.