EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and preferred speech quality. Audio samples are available online for demonstration1.

[1]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[2]  Jiangyan Yi,et al.  Forward–Backward Decoding Sequence for Regularizing End-to-End TTS , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[4]  Steve Whittaker,et al.  Semantic speech editing , 2004, CHI.

[5]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[7]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[8]  Alan Bundy,et al.  Dynamic Time Warping , 1984 .

[9]  A. Finkelstein,et al.  HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks , 2020, INTERSPEECH.

[10]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[11]  Zeyu Jin,et al.  Context-Aware Prosody Correction for Text-Based Speech Editing , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jianhua Tao,et al.  Patnet : A Phoneme-Level Autoregressive Transformer Network for Speech Synthesis , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jiajun Zhang,et al.  Synchronous Bidirectional Neural Machine Translation , 2019, TACL.

[14]  Mark D. Plumbley,et al.  A Contextual Study of Semantic Speech Editing in Radio Production , 2018, Int. J. Hum. Comput. Stud..

[15]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[16]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Gautham J. Mysore,et al.  VoCo , 2017, ACM Trans. Graph..

[19]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[20]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[21]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Wilmot Li,et al.  Content-based tools for editing audio stories , 2013, UIST.

[24]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[25]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[26]  Hui Bu,et al.  AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines , 2020, ArXiv.

[27]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).