The NII speech synthesis entry for Blizzard Challenge 2016

This paper decribes the NII speech synthesis entry for Blizzard Challenge 2016, where the task was to build a voice from audiobook data. The synthesis system is built using the NII parametric speech synthesis framework that utilizes Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) for acoustic modeling. For this entry, we first built a voice using a large data set, and then used the audiobook data to adapt the acoustic model to the target speaker. Additionally, the recent fullband glottal vocoder GlottDNN was used in the system with a DNN-based excitation model for generating glottal waveforms. The vocoder estimates the vocal tract in a band-wise manner using Quasi Closed Phase (QCP) inversefiltering at the low-band. At synthesis stage, the excitation model is used to generate voiced excitation from acoustic features, after which a vocal tract filter is applied to generate synthetic speech. The Blizzard Challenge listening test results show that the proposed system achieves comparable quality with the benchmark parametric synthesis systems. Index Terms: Blizzard Challenge, parametric speech synthesis, speaker adaptation, glottal vocoding, LSTM

[1]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2012: Hybrid Approach , 2012 .

[2]  Paavo Alku,et al.  Quasi Closed Phase Glottal Inverse Filtering Analysis With Weighted Linear Prediction , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Emilia Gómez,et al.  Towards Computer-Assisted Flamenco Transcription: An Experimental Comparison of Automatic Transcription Algorithms as Applied to A Cappella Singing , 2013, Computer Music Journal.

[4]  Paavo Alku,et al.  The GlottHMM Entry for Blizzard Challenge 2011: Utilizing Source Unit Selection in HMM-Based Speech Synthesis for Improved Excitation Generation , 2011 .

[5]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[6]  Arturo Camacho Lozano,et al.  SWIPE: A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music , 2011 .

[7]  Susan Fitt,et al.  On generating combilex pronunciations via morphological analysis , 2010, INTERSPEECH.

[8]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[10]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[11]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[12]  Xin Wang,et al.  Investigating very deep highway networks for parametric speech synthesis , 2018, Speech Commun..

[13]  Bajibabu Bollepalli,et al.  High-pitched excitation generation for glottal vocoding in statistical parametric speech synthesis using a deep neural network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Bajibabu Bollepalli,et al.  GlottDNN - A Full-Band Glottal Vocoder for Statistical Parametric Speech Synthesis , 2016, INTERSPEECH.

[15]  John G Harris,et al.  A sawtooth waveform inspired pitch estimator for speech and music. , 2008, The Journal of the Acoustical Society of America.

[16]  Paavo Alku,et al.  Voice source modelling using deep neural networks for statistical parametric speech synthesis , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[17]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[18]  Mandy Eberhart,et al.  Speech Communications Human And Machine , 2016 .

[19]  Roy D. Patterson,et al.  An instantaneous-frequency-based pitch extraction method for high-quality speech transformation: revised TEMPO in the STRAIGHT-suite , 1998, ICSLP.

[20]  James D. Johnston,et al.  A filter family designed for use in quadrature mirror filter banks , 1980, ICASSP.