Fcl-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech Synthesis

Sequence-to-sequence (seq2seq) learning has greatly improved text-to-speech (TTS) synthesis performance, but effective implementation on resource-restricted devices remains challenging as seq2seq models are usually computationally expensive and memory intensive. To achieve fast inference speed and small model size while maintain high-quality speech, we propose FCL-taco2, a Fast, Controllable and Lightweight (FCL) TTS model based on Tacotron2. FCL-taco2 adopts a novel semi-autoregressive (SAR) mode for phoneme level based parallel mel-spectrograms generation conditioned on prosody features, leading to faster inference speed and higher prosody controllability than Tacotron2. Besides, knowledge distillation (KD) is leveraged to compress a relatively large FCL-taco2 model to its small version with minor loss of speech quality. Experimental results on English (EN) and Chinese (CN) datasets show that the small version of FCL-taco2 achieves comparable performance with Tacotron2 in terms of speech quality, while it has a 4.8× smaller footprint with 17.7× and 18.5× faster inference speeds on average for EN and CN experiments respectively. Besides, execution on mobile devices shows that the proposed model can achieve faster than real-time speech synthesis. Our code and audio samples are released1.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Boris Ginsburg,et al.  TalkNet: Fully-Convolutional Non-Autoregressive Speech Synthesis Model , 2020, 2005.05514.

[3]  Jan Skoglund,et al.  A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet , 2019, INTERSPEECH.

[4]  Heiga Zen,et al.  Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices , 2016, INTERSPEECH.

[5]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[6]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[7]  Ondrej Dusek,et al.  SpeedySpeech: Efficient Neural Speech Synthesis , 2020, INTERSPEECH.

[8]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[9]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[10]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[11]  Tian Xia,et al.  Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Farinaz Koushanfar,et al.  FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[14]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Adrian La'ncucki FastPitch: Parallel Text-to-speech with Pitch Prediction , 2020, ArXiv.

[16]  Zhao Song,et al.  Parallel Neural Text-to-Speech , 2019, ArXiv.

[17]  Shujie Liu,et al.  MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search , 2020, INTERSPEECH.

[18]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[19]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[20]  Harry Shum,et al.  From Eliza to XiaoIce: challenges and opportunities with social chatbots , 2018, Frontiers of Information Technology & Electronic Engineering.

[21]  Kurt Keutzer,et al.  SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis , 2020, ArXiv.

[22]  Ran Zhang,et al.  PPSpeech: Phrase based Parallel End-to-End TTS System , 2020, ArXiv.

[23]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[24]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[25]  Shuang Liang,et al.  Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Sercan Ömer Arik,et al.  Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning , 2017, ICLR.

[28]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[31]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.