Decoupling Recognition and Transcription in Mandarin ASR

Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio → Hanzi into two sub-tasks: (1) audio → Pinyin and (2) Pinyin → Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio → Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.

[1]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[2]  Peter Bell,et al.  Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models , 2021, Interspeech.

[3]  Haizhou Li,et al.  Self-and-Mixed Attention Decoder with Deep Acoustic Structure for Transformer-based LVCSR , 2020, INTERSPEECH.

[4]  Lei Xie,et al.  WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit , 2021, Interspeech.

[5]  Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search , 2021, ArXiv.

[6]  Ning Cheng,et al.  Applying wav2vec2.0 to Speech Recognition in various low-resource languages , 2020, ArXiv.

[7]  Boris Ginsburg,et al.  Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition , 2021, 2104.01721.

[8]  Lei Xie,et al.  Towards Language-Universal Mandarin-English Speech Recognition , 2019, INTERSPEECH.

[9]  Wei Chen,et al.  WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition , 2021, ArXiv.

[10]  Michael Picheny,et al.  New methods in continuous Mandarin speech recognition , 1997, EUROSPEECH.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Lujun Li,et al.  Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition , 2021, EURASIP Journal on Audio, Speech, and Music Processing.

[13]  Lei Xie,et al.  Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[14]  Shuai Zhang,et al.  One in a hundred: Select the best predicted sequence from numerous candidates for streaming speech recognition , 2020 .

[15]  William Chan,et al.  On Online Attention-Based Speech Recognition and Joint Mandarin Character-Pinyin Training , 2016, INTERSPEECH.

[16]  Chng Eng Siong,et al.  Independent Language Modeling Architecture for End-To-End ASR , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[18]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[19]  Tara N. Sainath,et al.  Bytes Are All You Need: End-to-end Multilingual Speech Recognition and Synthesis with Bytes , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Chao Huang,et al.  Large vocabulary Mandarin speech recognition with different approaches in modeling tones , 2000, INTERSPEECH.

[21]  Chao Weng,et al.  Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[23]  Shiliang Zhang,et al.  Simplified Self-Attention for Transformer-Based end-to-end Speech Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[24]  Kenneth Ward Church,et al.  Automatic recognition of suprasegmentals in speech , 2021, ArXiv.

[25]  Kuan-Yu Chen,et al.  Non-autoregressive Transformer-based End-to-end ASR using BERT , 2021, ArXiv.

[26]  Jing Xiao,et al.  Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[27]  Xiao-Lei Zhang,et al.  Efficient conformer-based speech recognition with linear attention , 2021, 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[28]  Lei Xie,et al.  Efficient Gradient-Based Neural Architecture Search For End-to-End ASR , 2021, ICMI Companion.

[29]  Lei Xie,et al.  Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition , 2020, ArXiv.

[30]  Jianhua Tao,et al.  Gated Recurrent Fusion With Joint Training Framework for Robust End-to-End Speech Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Lei Xie,et al.  Cascade RNN-Transducer: Syllable Based Streaming On-Device Mandarin Speech Recognition with a Syllable-To-Character Converter , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[32]  Menglong Xu,et al.  Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention , 2021, Interspeech 2021.

[33]  Xiangang Li,et al.  A Further Study of Unsupervised Pretraining for Transformer Based Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  J. Tao,et al.  Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition , 2020, INTERSPEECH.

[35]  Wen Wang,et al.  Building A Highly Accurate Mandarin Speech Recognizer With Language-Independent Technologies and Language-Dependent Modules , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Xiangang Li,et al.  A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition , 2013, Neurocomputing.

[37]  Kenneth Ward Church,et al.  Speech Emotion Recognition with Multi-Task Learning , 2021, Interspeech.

[38]  FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition , 2021, ArXiv.

[39]  Mohan Li,et al.  Transformer-Based Online Speech Recognition with Decoder-end Adaptive Computation Steps , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[40]  Tatsuya Komatsu,et al.  Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions , 2021, Interspeech.

[41]  Jun Zhang,et al.  Improving RNN transducer with normalized jointer network , 2020, ArXiv.

[42]  Weibin Zhang,et al.  Multi-head Monotonic Chunkwise Attention For Online Speech Recognition , 2020, ArXiv.

[43]  Shuai Zhang,et al.  Decoupling Pronunciation and Language for End-to-End Code-Switching Automatic Speech Recognition , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Mari Ostendorf,et al.  Modeling lexical tones for mandarin large vocabulary continuous speech recognition , 2006 .

[45]  Shiliang Zhang,et al.  Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Shuang Xu,et al.  A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese , 2018, ICONIP.

[47]  Li Fu,et al.  Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition , 2020, ArXiv.

[48]  Shouyi Yin,et al.  Transformer with Bidirectional Decoder for Speech Recognition , 2020, INTERSPEECH.

[49]  Shinji Watanabe,et al.  Intermediate Loss Regularization for CTC-Based Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Rama Doddipatla,et al.  Head-Synchronous Decoding for Transformer-Based Streaming ASR , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).