Recent Advances in End-to-End Automatic Speech Recognition

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective.

[1]  George Sterpu,et al.  Learning to Count Words in Fluent Speech enables Online Speech Recognition , 2020, ArXiv.

[2]  Bhuvana Ramabhadran,et al.  Multilingual Speech Recognition with Self-Attention Structured Parameterization , 2020, INTERSPEECH.

[3]  Tara N. Sainath,et al.  Emitting Word Timings with End-to-End Models , 2020, INTERSPEECH.

[4]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Jonathan Le Roux,et al.  An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Han Lu,et al.  End-To-End Multi-Talker Overlapping Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Takuya Yoshioka,et al.  Advances in Online Audio-Visual Meeting Transcription , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Fang Deng,et al.  End-to-End Code-Switching ASR for Low-Resourced Language Pairs , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Mohan Li,et al.  End-to-end Speech Recognition with Adaptive Computation Steps , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Gil Keren,et al.  Alignment Restricted Streaming Recurrent Neural Network Transducer , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[12]  Daehyun Kim,et al.  Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Lei Xie,et al.  Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[14]  Yu Zhang,et al.  Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[15]  Cyril Allauzen,et al.  Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Daehyun Kim,et al.  Iterative Compression of End-to-End ASR Model using AutoML , 2020, INTERSPEECH.

[17]  Jinyu Li,et al.  Improved training for online end-to-end speech recognition systems , 2017, INTERSPEECH.

[18]  Horia Cucu,et al.  An Evaluation of Word-Level Confidence Estimation for End-to-End Automatic Speech Recognition , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[19]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[20]  Xiong Xiao,et al.  Developing Far-Field Speaker System Via Teacher-Student Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  Recognizing Long-Form Speech Using Streaming End-to-End Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Naoyuki Kanda,et al.  On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer , 2020, Interspeech.

[25]  Brian Kingsbury,et al.  Advancing RNN Transducer Technology for Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[27]  Yanmin Qian,et al.  Exploring Model Units and Training Strategies for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Naoyuki Kanda,et al.  Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition , 2021, Interspeech.

[29]  Tatsuya Kawahara,et al.  Transfer Learning of Language-independent End-to-end ASR with Language Model Fusion , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[31]  Florian Metze,et al.  Towards Context-Aware End-to-End Code-Switching Speech Recognition , 2020, INTERSPEECH.

[32]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33]  Furu Wei,et al.  UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset , 2021, 2107.05233.

[34]  George Saon,et al.  Alignment-Length Synchronous Decoding for RNN Transducer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Qian Zhang,et al.  Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition , 2020, ArXiv.

[36]  Tara N. Sainath,et al.  Multitask Training with Text Data for End-to-End Speech Recognition , 2020, Interspeech.

[37]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Wei Chu,et al.  Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[39]  Brian Kingsbury,et al.  Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Philip C. Woodland,et al.  Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[41]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[42]  Jonathan Le Roux,et al.  Streaming Automatic Speech Recognition with the Transformer Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Jonathan Le Roux,et al.  Transformer-Based Long-Context End-to-End Speech Recognition , 2020, INTERSPEECH.

[44]  Shinji Watanabe,et al.  Auxiliary Feature Based Adaptation of End-to-end ASR Systems , 2018, INTERSPEECH.

[45]  Kartik Audhkhasi,et al.  Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation , 2019, INTERSPEECH.

[46]  Tara N. Sainath,et al.  Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling , 2020, ICLR.

[47]  Brian Kingsbury,et al.  4-bit Quantization of LSTM-based Speech Recognition Models , 2021, Interspeech.

[48]  Titouan Parcollet,et al.  SpeechBrain: A General-Purpose Speech Toolkit , 2021, ArXiv.

[49]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[50]  Florian Metze,et al.  Subword and Crossword Units for CTC Acoustic Models , 2017, INTERSPEECH.

[51]  Jinyu Li,et al.  Improving Multilingual Transformer Transducer Models by Reducing Language Confusions , 2021, Interspeech.

[52]  Hermann Ney,et al.  Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Naoyuki Kanda,et al.  Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Shinji Watanabe,et al.  Gaussian Kernelized Self-Attention for Long Sequence Data and its Application to CTC-Based Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Liang Lu,et al.  Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition , 2017, INTERSPEECH.

[56]  Hieu Duy Nguyen,et al.  Quantization Aware Training with Absolute-Cosine Regularization for Automatic Speech Recognition , 2020, INTERSPEECH.

[57]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Chao Weng,et al.  Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[60]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[62]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[63]  Athanasios Mouchtaris,et al.  CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition , 2021, Interspeech.

[64]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[65]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[66]  Ivan Medennikov,et al.  Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription , 2020, INTERSPEECH.

[67]  Nicolas Usunier,et al.  End-to-End Speech Recognition From the Raw Waveform , 2018, INTERSPEECH.

[68]  Yifan Gong,et al.  Advancing Connectionist Temporal Classification with Attention Modeling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[69]  Geoffrey Zweig,et al.  Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[70]  Yulan Liu,et al.  Streaming Multi-Speaker ASR with RNN-T , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[72]  Liang Lu,et al.  Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[74]  Hagen Soltau,et al.  Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction , 2021, Interspeech.

[75]  Shinji Watanabe,et al.  End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[76]  Gil Keren,et al.  Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion , 2021, Interspeech 2021.

[77]  Matt Shannon,et al.  Optimizing Expected Word Error Rate via Sampling for Speech Recognition , 2017, INTERSPEECH.

[78]  Khe Chai Sim,et al.  Efficient Implementation of Recurrent Neural Network Transducer in Tensorflow , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[79]  Khe Chai Sim,et al.  An Investigation Into On-device Personalization of End-to-end Automatic Speech Recognition Models , 2019, INTERSPEECH.

[80]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[81]  Tara N. Sainath,et al.  A Better and Faster end-to-end Model for Streaming ASR , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[82]  Tatsuya Kawahara,et al.  Distilling the Knowledge of BERT for Sequence-to-Sequence ASR , 2020, INTERSPEECH.

[83]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[84]  Linhao Dong,et al.  CIF: Continuous Integrate-And-Fire for End-To-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  Athanasios Mouchtaris,et al.  Phonetically Induced Subwords for End-to-End Speech Recognition , 2021, Interspeech 2021.

[86]  Yongqiang Wang,et al.  Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR , 2019, INTERSPEECH.

[87]  Sheng Zhao,et al.  A Light-weight contextual spelling correction model for customizing transducer-based speech recognition systems , 2021, Interspeech.

[88]  Tara N. Sainath,et al.  Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model , 2019, INTERSPEECH.

[89]  Maja Pantic,et al.  Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[90]  Jonathan Le Roux,et al.  Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[91]  Tara N. Sainath,et al.  A Streaming On-Device End-To-End Model Surpassing Server-Side Conventional Model Quality and Latency , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92]  Qian Zhang,et al.  Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[93]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[94]  Tara N. Sainath,et al.  BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition , 2021, ArXiv.

[95]  Parisa Haghani,et al.  Leveraging Language ID in Multilingual End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[96]  Andreas Stolcke,et al.  Joint ASR and Language Identification Using RNN-T: An Efficient Approach to Dynamic Language Switching , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Bhuvana Ramabhadran,et al.  End-to-end speech recognition and keyword search on low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[98]  John R. Hershey,et al.  Joint CTC/attention decoding for end-to-end speech recognition , 2017, ACL.

[99]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[100]  Carlos Busso,et al.  End-to-End Audiovisual Speech Recognition System With Multitask Learning , 2021, IEEE Transactions on Multimedia.

[101]  Florian Metze,et al.  Sequence-Based Multi-Lingual Low Resource Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[102]  Gil Keren,et al.  Deep Shallow Fusion for RNN-T Personalization , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[103]  Richard Socher,et al.  Improved Regularization Techniques for End-to-End Speech Recognition , 2017, ArXiv.

[104]  Chao Weng,et al.  Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition , 2021, Interspeech 2021.

[105]  Richard Socher,et al.  An Investigation of Phone-Based Subword Units for End-to-End Speech Recognition , 2020, INTERSPEECH.

[106]  Ariya Rastrow,et al.  Amortized Neural Networks for Low-Latency Speech Recognition , 2021, Interspeech.

[107]  Takaaki Hori,et al.  Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers , 2021, Interspeech.

[108]  Bhuvana Ramabhadran,et al.  Mixture of Informed Experts for Multilingual Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[109]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[110]  Andreas Schwarz,et al.  Improving RNN-T ASR Accuracy Using Context Audio , 2020, Interspeech.

[111]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[112]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[113]  Zhiheng Huang,et al.  Self-attention Networks for Connectionist Temporal Classification in Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[114]  Brian Kingsbury,et al.  Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio , 2021, Interspeech.

[115]  Colin Raffel,et al.  Monotonic Chunkwise Attention , 2017, ICLR.

[116]  Daniel S. Park,et al.  Efficient Knowledge Distillation for RNN-Transducer Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[117]  Liang Lu,et al.  On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[118]  Tetsunori Kobayashi,et al.  Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict , 2020, INTERSPEECH.

[119]  Hagen Soltau,et al.  Joint Speech Recognition and Speaker Diarization via Sequence Transduction , 2019, INTERSPEECH.

[120]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  Jonathan Le Roux,et al.  MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[122]  Shinji Watanabe,et al.  End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming , 2020, INTERSPEECH.

[123]  Nanyun Peng,et al.  Espresso: A Fast End-to-End Neural Speech Recognition Toolkit , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[124]  Exploring End-to-End Multi-channel ASR with Bias Information for Meeting Transcription , 2020, ArXiv.

[125]  Jonathan Le Roux,et al.  Streaming End-to-End Speech Recognition with Joint CTC-Attention Based Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[126]  Tara N. Sainath,et al.  Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus , 2020, INTERSPEECH.

[127]  Liangliang Cao,et al.  Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[128]  George Saon,et al.  Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition , 2020, INTERSPEECH.

[129]  Naoyuki Kanda,et al.  Streaming Multi-talker Speech Recognition with Joint Speaker Identification , 2021, Interspeech.

[130]  Dushyant Sharma,et al.  Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition , 2021, Interspeech 2021.

[131]  Hermann Ney,et al.  CTC in the Context of Generalized Full-Sum HMM Training , 2017, INTERSPEECH.

[132]  Roland Maas,et al.  Streaming End-to-End Bilingual ASR Systems with Joint Language Identification , 2020, ArXiv.

[133]  Xiao Chen,et al.  Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition , 2020, INTERSPEECH.

[134]  Jonathan Le Roux,et al.  End-To-End Multi-Speaker Speech Recognition With Transformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[135]  Xiaofei Wang,et al.  Serialized Output Training for End-to-End Overlapped Speech Recognition , 2020, INTERSPEECH.

[136]  Yonghong Yan,et al.  Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[137]  Jinyu Li,et al.  A Configurable Multilingual Model is All You Need to Recognize All Languages , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[138]  Yifan Gong,et al.  Towards Code-switching ASR for End-to-end CTC Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[139]  Chengyi Wang,et al.  Low Latency End-to-End Streaming Speech Recognition with a Scout Network , 2020, INTERSPEECH.

[140]  Steve Renals,et al.  Adaptation Algorithms for Speech Recognition: An Overview , 2020, ArXiv.

[141]  J. Tao,et al.  Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition , 2020, INTERSPEECH.

[142]  Srikanth Ronanki,et al.  Transformer-Transducers for Code-Switched Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[143]  Hermann Ney,et al.  Returnn: The RWTH extensible training framework for universal recurrent neural networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[144]  Ruslan Salakhutdinov,et al.  Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training? , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[145]  Hung-yi Lee,et al.  Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[146]  Tara N. Sainath,et al.  No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[147]  Suyoun Kim,et al.  Towards Language-Universal End-to-End Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[148]  Patrick Nguyen,et al.  Model Unit Exploration for Sequence-to-Sequence Speech Recognition , 2019, ArXiv.

[149]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[150]  Shinji Watanabe,et al.  Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios , 2021, Interspeech.

[151]  Hagen Soltau,et al.  Monotonic Recurrent Neural Network Transducer and Decoding Strategies , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[152]  Titouan Parcollet,et al.  E2E-SINCNET: Toward Fully End-To-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[153]  Jiangyan Yi,et al.  Synchronous Transformers for end-to-end Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[154]  Frank Zhang,et al.  Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition , 2020, ArXiv.

[155]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[156]  Puming Zhan,et al.  Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems , 2021, Interspeech 2021.

[157]  Khe Chai Sim,et al.  Robust Continuous On-Device Personalization for Automatic Speech Recognition , 2021, Interspeech.

[158]  Shinji Watanabe,et al.  Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.

[159]  Yifan Gong,et al.  On Addressing Practical Challenges for RNN-Transducer , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[160]  Steve Renals,et al.  Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[161]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[162]  Yifan Gong,et al.  Acoustic-to-word model without OOV , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[163]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[164]  Tara N. Sainath,et al.  An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling , 2021, Interspeech.

[165]  Yashesh Gaur,et al.  Listen, Look and Deliberate: Visual Context-Aware Speech Recognition Using Pre-Trained Text-Video Representations , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[166]  Shinji Watanabe,et al.  Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[167]  Shinji Watanabe,et al.  Recent Developments on Espnet Toolkit Boosted By Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[168]  Puming Zhan,et al.  Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR , 2019, INTERSPEECH.

[169]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[170]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[171]  I-Fan Chen,et al.  Maximum a posteriori adaptation of network parameters in deep models , 2015, INTERSPEECH.

[172]  Naomi Harte,et al.  Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition , 2018, ICMI.

[173]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[174]  Shinji Watanabe,et al.  Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[175]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[176]  H. H. Mao,et al.  Speech Recognition and Multi-Speaker Diarization of Long Conversations , 2020, INTERSPEECH.

[177]  Shiliang Zhang,et al.  Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition , 2019, INTERSPEECH.

[178]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[179]  Naoyuki Kanda,et al.  Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[180]  Jinyu Li,et al.  Streaming End-to-End Multi-Talker Speech Recognition , 2020, IEEE Signal Processing Letters.

[181]  Giovanni Motta,et al.  Personalization of End-to-End Speech Recognition on Mobile Devices for Named Entities , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[182]  Jonathan Le Roux,et al.  End-to-End Multi-Speaker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[183]  Naoyuki Kanda,et al.  Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[184]  Yanmin Qian,et al.  Knowledge Distillation for End-to-End Monaural Multi-Talker ASR System , 2019, INTERSPEECH.

[185]  Daniel Willett,et al.  Using Synthetic Audio to Improve The Recognition of Out-Of-Vocabulary Words in End-To-End ASR Systems , 2020, ArXiv.

[186]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[187]  Wei Li,et al.  Monotonic Infinite Lookback Attention for Simultaneous Machine Translation , 2019, ACL.

[188]  Nicolas Usunier,et al.  Fully Convolutional Speech Recognition , 2018, ArXiv.

[189]  Hao Li,et al.  Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts , 2020, INTERSPEECH.

[190]  Yashesh Gaur,et al.  On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition , 2020, INTERSPEECH.

[191]  John R. Hershey,et al.  Language independent end-to-end architecture for joint language identification and speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[192]  Wei Chu,et al.  CASS-NAT: CTC Alignment-based Single Step Non-autoregressive Transformer for Speech Recognition , 2020, ArXiv.

[193]  Tetsuji Ogawa,et al.  Improved Mask-CTC for Non-Autoregressive End-to-End ASR , 2020, ArXiv.

[194]  Shinji Watanabe,et al.  End-to-end Monaural Multi-speaker ASR System without Pretraining , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[195]  Jinyu Li,et al.  Factorized Neural Transducer for Efficient Language Model Adaptation , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[196]  Tara N. Sainath,et al.  An Attention-Based Joint Acoustic and Text on-Device End-To-End Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[197]  Xiangang Li,et al.  Semantic Data Augmentation for End-to-End Mandarin Speech Recognition , 2021, Interspeech.

[198]  Colin Raffel,et al.  Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[199]  Yifan Gong,et al.  Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need , 2021, Interspeech.

[200]  Tara N. Sainath,et al.  Compression of End-to-End Models , 2018, INTERSPEECH.

[201]  Gabriel Synnaeve,et al.  Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters , 2020, INTERSPEECH.

[202]  Tara N. Sainath,et al.  Semi-supervised Training for End-to-end Models via Weak Distillation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[203]  Naoyuki Kanda,et al.  Maximum a posteriori Based Decoding for CTC Acoustic Models , 2016, INTERSPEECH.

[204]  Quoc V. Le,et al.  Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[205]  Tara N. Sainath,et al.  Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[206]  James R. Glass,et al.  Combining End-to-End and Adversarial Training for Low-Resource Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[207]  Hermann Ney,et al.  A Comparison of Transformer and LSTM Encoder Decoder Models for ASR , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[208]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[209]  Xiaofeng Liu,et al.  Rnn-Transducer with Stateless Prediction Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[210]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[211]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[212]  Furu Wei,et al.  UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data , 2021, ICML.

[213]  Ho-Gyeong Kim,et al.  Knowledge Distillation Using Output Errors for Self-attention End-to-end Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[214]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[215]  Naoyuki Kanda,et al.  Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition , 2021, ArXiv.

[216]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[217]  Tara N. Sainath,et al.  FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[218]  Maja Pantic,et al.  End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[219]  Vikas Chandra,et al.  Collaborative Training of Acoustic Encoders for Speech Recognition , 2021, Interspeech.

[220]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[221]  Atsushi Kojima Knowledge Distillation for Streaming Transformer-Transducer , 2021, Interspeech.

[222]  Kjell Schubert,et al.  Transformer-Transducer: End-to-End Speech Recognition with Self-Attention , 2019, ArXiv.

[223]  Shinji Watanabe,et al.  Transformer ASR with Contextual Block Processing , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[224]  Davis Liang,et al.  Learning Noise-Invariant Representations for Robust Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[225]  Ramón Fernández Astudillo,et al.  Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text , 2019, INTERSPEECH.

[226]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[227]  Tara N. Sainath,et al.  Cascaded Encoders for Unifying Streaming and Non-Streaming ASR , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[228]  Tara N. Sainath,et al.  Transformer Based Deliberation for Two-Pass Speech Recognition , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[229]  Yifan Gong,et al.  Have best of both worlds: two-pass hybrid and E2E cascading framework for speech recognition , 2021, ArXiv.

[230]  Yashesh Gaur,et al.  Continuous Streaming Multi-Talker ASR with Dual-Path Transducers , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[231]  Athanasios Mouchtaris,et al.  Multi-Channel Transformer Transducer for Speech Recognition , 2021, Interspeech.

[232]  Kevin Duh,et al.  Multilingual End-to-End Speech Translation , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[233]  Chengzhu Yu,et al.  Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition , 2019, INTERSPEECH.

[234]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[235]  Tara N. Sainath,et al.  A Deliberation-Based Joint Acoustic and Text Decoder , 2021, Interspeech.

[236]  Ding Zhao,et al.  Dynamic Sparsity Neural Networks for Automatic Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[237]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[238]  Tara N. Sainath,et al.  Scaling End-to-End Models for Large-Scale Multilingual ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[239]  Tara N. Sainath,et al.  A Comparison of End-to-End Models for Long-Form Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[240]  Ryo Masumura,et al.  Distilling Attention Weights for CTC-Based ASR Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[241]  Dong Yu,et al.  Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[242]  Rohit Prabhavalkar,et al.  Dissecting User-Perceived Latency of On-Device E2E Speech Recognition , 2021, Interspeech.

[243]  Julian Chan,et al.  Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency , 2021, Interspeech.

[244]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[245]  Tara N. Sainath,et al.  A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[246]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[247]  Yifan Gong,et al.  Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator , 2020, INTERSPEECH.

[248]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[249]  Wei Chen,et al.  Modality Attention for End-to-end Audio-visual Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[250]  Tara N. Sainath,et al.  Learning Word-Level Confidence for Subword End-To-End ASR , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[251]  Jun Wang,et al.  Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[252]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[253]  Yashesh Gaur,et al.  Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[254]  Tara N. Sainath,et al.  Deep Context: End-to-end Contextual Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[255]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[256]  Athanasios Mouchtaris,et al.  End-to-End Multi-Channel Transformer for Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[257]  Yashesh Gaur,et al.  Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[258]  Shuai Zhang,et al.  Rnn-transducer With Language Bias For End-to-end Mandarin-English Code-switching Speech Recognition , 2020, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[259]  Andreas Stolcke,et al.  Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition , 2020, INTERSPEECH.

[260]  Tara N. Sainath,et al.  Improving Performance of End-to-End ASR on Numeric Sequences , 2019, INTERSPEECH.

[261]  Samarth Bharadwaj,et al.  Multilingual and code-switching ASR challenges for low resource Indian languages , 2021, Interspeech.

[262]  Kshitiz Kumar,et al.  Multi-Dialect Speech Recognition in English Using Attention on Ensemble of Experts , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[263]  Yashesh Gaur,et al.  Speaker Adaptation for Attention-Based End-to-End Speech Recognition , 2019, INTERSPEECH.

[264]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[265]  Yu-An Chung,et al.  Generative Pre-Training for Speech with Autoregressive Predictive Coding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[266]  Gil Keren,et al.  Contextual RNN-T For Open Domain ASR , 2020, INTERSPEECH.

[267]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[268]  Haizhou Li,et al.  Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition , 2020, INTERSPEECH.

[269]  Mohan Li,et al.  Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[270]  Tatsuya Kawahara,et al.  Acoustic-to-Word Attention-Based Model Complemented with Character-Level CTC-Based Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[271]  Shinji Watanabe,et al.  Non-Autoregressive Transformer for Speech Recognition , 2021, IEEE Signal Processing Letters.

[272]  Hao Tang,et al.  End-to-End Neural Segmental Models for Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[273]  Gunnar Evermann,et al.  Class LM and word mapping for contextual biasing in End-to-End ASR , 2020, INTERSPEECH.

[274]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[275]  Shinji Watanabe,et al.  Streaming Transformer Asr With Blockwise Synchronous Beam Search , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[276]  Hung-yi Lee,et al.  Towards Lifelong Learning of End-to-end ASR , 2021, Interspeech.

[277]  Hung-yi Lee,et al.  Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation , 2021, FINDINGS.

[278]  Vikas Joshi,et al.  Transfer Learning Approaches for Streaming End-to-End Speech Recognition System , 2020, INTERSPEECH.

[279]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[280]  John R. Hershey,et al.  Multichannel End-to-end Speech Recognition , 2017, ICML.

[281]  Chng Eng Siong,et al.  Speech Transformer with Speaker Aware Persistent Memory , 2020, INTERSPEECH.

[282]  Jiangyan Yi,et al.  Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[283]  Maurizio Omologo,et al.  Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[284]  Stefan Riezler,et al.  On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR , 2021, Interspeech.

[285]  Srinivasan Umesh,et al.  Investigation of Methods to Improve the Recognition Performance of Tamil-English Code-Switched Data in Transformer Framework , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[286]  Yifan Gong,et al.  Speaker Adaptation for End-to-End CTC Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[287]  Matteo Negri,et al.  Adapting Transformer to End-to-End Spoken Language Translation , 2019, INTERSPEECH.

[288]  Matt Shannon,et al.  Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping , 2017, INTERSPEECH.

[289]  Ozlem Kalinli,et al.  Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios , 2021, Interspeech.

[290]  Tara N. Sainath,et al.  Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[291]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[292]  Preethi Jyothi,et al.  An Investigation of End-to-End Models for Robust Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[293]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[294]  Ehsan Variani,et al.  A Density Ratio Approach to Language Model Fusion in End-to-End Automatic Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[295]  Bhiksha Raj,et al.  Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , 2012, IEEE Signal Processing Magazine.

[296]  Tatsuya Kawahara,et al.  Enhancing Monotonic Multihead Attention for Streaming ASR , 2020, INTERSPEECH.

[297]  Kyu J. Han,et al.  Multi-mode Transformer Transducer with Stochastic Future Context , 2021, Interspeech.

[298]  Bin Ma,et al.  Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data , 2019, INTERSPEECH.

[299]  Hasim Sak,et al.  Reducing Streaming ASR Model Delay with Self Alignment , 2021, Interspeech.

[300]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[301]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[302]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[303]  Yashesh Gaur,et al.  Combination of End-to-End and Hybrid Models for Speech Recognition , 2020, INTERSPEECH.

[304]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[305]  Lin-Shan Lee,et al.  Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[306]  Yashesh Gaur,et al.  Acoustic-to-Phrase Models for Speech Recognition , 2019, INTERSPEECH.

[307]  Nikko Strom,et al.  Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[308]  Hermann Ney,et al.  Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition , 2021, Interspeech 2021.

[309]  Zhong Meng,et al.  Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability , 2020, INTERSPEECH.

[310]  Tara N. Sainath,et al.  Phoebe: Pronunciation-aware Contextualization for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[311]  Lei Xie,et al.  WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit , 2021, Interspeech.

[312]  Janne Pylkkönen,et al.  Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network , 2021, Interspeech.

[313]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[314]  Tara N. Sainath,et al.  Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[315]  Naoyuki Kanda,et al.  Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers , 2020, INTERSPEECH.

[316]  Alexander H. Waibel,et al.  Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[317]  Lei Xie,et al.  Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition , 2020, INTERSPEECH.