Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a target sequence. The Connectionist Temporal Classification (CTC) criterion is widely used in multiple seq2seq tasks. Besides predicting the target sequence, a side product of CTC is to predict the alignment, which is the most probable input-long sequence that specifies a hard aligning relationship between the input and target units. As there are multiple potential aligning sequences (called paths) that are equally considered in CTC formulation, the choice of which path will be most probable and become the predicted alignment is always uncertain. In addition, it is usually observed that the alignment predicted by vanilla CTC will drift compared with its reference and rarely provides practical functionalities. Thus, the motivation of this work is to make the CTC alignment prediction controllable and thus equip CTC with extra functionalities. The Bayes risk CTC (BRCTC) criterion is then proposed in this work, in which a customizable Bayes risk function is adopted to enforce the desired characteristics of the predicted alignment. With the risk function, the BRCTC is a general framework to adopt some customizable preference over the paths in order to concentrate the posterior into a particular subset of the paths. In applications, we explore one particular preference which yields models with the down-sampling ability and reduced inference costs. By using BRCTC with another preference for early emissions, we obtain an improved performance-latency trade-off for online models. Experimentally, the proposed BRCTC reduces the inference cost of offline models by up to 47% without performance degradation and cuts down the overall latency of online systems to an unseen level.

[1]  Graham Neubig,et al.  CTC Alignments Improve Autoregressive Translation , 2022, EACL.

[2]  Shinji Watanabe,et al.  Minimum latency training of sequence transducers for streaming end-to-end speech recognition , 2022, INTERSPEECH.

[3]  Daniel Povey,et al.  Pruned RNN-T for fast, memory-efficient ASR training , 2022, INTERSPEECH.

[4]  M. Lewis,et al.  LegoNN: Building Modular Encoder-Decoder Models , 2022, ArXiv.

[5]  P. Bell,et al.  Investigating Sequence-Level Normalisation For CTC-Like End-to-End ASR , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Ronan Collobert,et al.  Star Temporal Classification: Sequence Classification with Partially Labeled Data , 2022, ArXiv.

[7]  Lili Mou,et al.  Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision , 2021, AAAI.

[8]  Lei Xie,et al.  WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Boris Ginsburg,et al.  CTC Variations Through New WFST Topologies , 2021, INTERSPEECH.

[10]  Hermann Ney,et al.  Why does CTC result in peaky behavior? , 2021, ArXiv.

[11]  Hung-yi Lee,et al.  Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation , 2021, FINDINGS.

[12]  Hasim Sak,et al.  Reducing Streaming ASR Model Delay with Self Alignment , 2021, Interspeech.

[13]  Shinji Watanabe,et al.  Intermediate Loss Regularization for CTC-Based Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Mauro Cettolo,et al.  CTC-based Compression for Direct Speech Translation , 2021, EACL.

[15]  Lei Xie,et al.  WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit , 2021, Interspeech.

[16]  Jiatao Gu,et al.  Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade , 2020, FINDINGS.

[17]  Hang Su,et al.  Alignment Restricted Streaming Recurrent Neural Network Transducer , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[18]  Jonathan Le Roux,et al.  Semi-Supervised Speech Recognition Via Graph-Based Temporal Classification , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yu Wu,et al.  Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  M. Seltzer,et al.  Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Weinan Zhang,et al.  Glancing Transformer for Non-Autoregressive Neural Machine Translation , 2020, ACL.

[23]  Jiajun Zhang,et al.  Bridging the Modality Gap for Speech-to-Text Translation , 2020, ArXiv.

[24]  Vineel Pratap,et al.  Differentiable Weighted Finite-State Transducers , 2020, ArXiv.

[25]  Hermann Ney,et al.  A New Training Pipeline for an Improved Neural Transducer , 2020, INTERSPEECH.

[26]  Tetsunori Kobayashi,et al.  Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict , 2020, INTERSPEECH.

[27]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[28]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Peter Plantinga,et al.  Towards Real-Time Mispronunciation Detection in Kids' Speech , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[30]  Shuo Wang,et al.  Dense Temporal Convolution Network for Sign Language Translation , 2019, IJCAI.

[31]  Mattia Antonino Di Gangi,et al.  MuST-C: a Multilingual Speech Translation Corpus , 2019, NAACL.

[32]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[33]  Kartik Audhkhasi,et al.  Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation , 2019, INTERSPEECH.

[34]  Meng Wang,et al.  Connectionist Temporal Fusion for Sign Language Translation , 2018, ACM Multimedia.

[35]  John H. L. Hansen,et al.  Advancing Multi-Accented Lstm-CTC Speech Recognition Using a Domain Specific Student-Teacher Learning Paradigm , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[36]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[37]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[38]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[39]  Shimon Whiteson,et al.  TACO: Learning Task Decomposition via Temporal Alignment for Control , 2018, ICML.

[40]  Olivier Pietquin,et al.  End-to-End Automatic Speech Translation of Audiobooks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[42]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[43]  Matt Shannon,et al.  Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[46]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[49]  Hermann Ney,et al.  Framewise and CTC training of Neural Networks for handwriting recognition , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[50]  Ralph Roskies,et al.  Bridges: a uniquely flexible HPC resource for new communities and data analytics , 2015, XSEDE.

[51]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[55]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[56]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[57]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[58]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.