Recent Developments on Espnet Toolkit Boosted By Conformer

In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.

[1]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[2]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[4]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[6]  Tomoki Toda,et al.  Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Shinji Watanabe,et al.  Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.

[10]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[11]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Di He,et al.  Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View , 2019, ArXiv.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[16]  Kevin Duh,et al.  ESPnet-ST: All-in-One Speech Translation Toolkit , 2020, ACL.

[17]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[18]  Shinji Watanabe,et al.  End-to-end Speech Recognition With Word-Based Rnn Language Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[21]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[22]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[23]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[24]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[25]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[26]  Adrian La'ncucki FastPitch: Parallel Text-to-speech with Pitch Prediction , 2020, ArXiv.

[27]  Nicholay Topin,et al.  Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.

[28]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[29]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Hermann Ney,et al.  A Comparison of Transformer and LSTM Encoder Decoder Models for ASR , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.