论文信息 - Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing

Streaming On-Device End-to-End ASR System for Privacy-Sensitive Voice-Typing

In this paper, we present our streaming on-device end-to-end speech recognition solution for a privacy sensitive voice-typing application which primarily involves typing user private details and passwords. We highlight challenges specific to voice-typing scenario in the Korean language and propose solutions to these problems within the framework of a streaming attention-based speech recognition system. Some important challenges in voicetyping are the choice of output units, coupling of multiple characters into longer byte-pair encoded units, lack of sufficient training data. Apart from customizing a high accuracy open domain streaming speech recognition model for voice-typing applications, we retain the performance of the model for open domain tasks without significant degradation. We also explore domain biasing using a shallow fusion with a weighted finite state transducer (WFST). We obtain approximately 13 % relative word error rate (WER) improvement on our internal Korean voice-typing dataset without a WFST and about 30% additional WER improvement with a WFST fusion.

[1] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Daehyun Kim,et al. Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3] Ankur Kumar,et al. Utterance Invariant Training for Hybrid Two-Pass End-to-End Speech Recognition , 2020, INTERSPEECH.

[4] Dongsoo Lee,et al. DeepTwist: Learning Model Compression via Occasional Weight Distortion , 2018, ArXiv.

[5] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[6] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[7] Pete Warden,et al. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[8] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[9] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[10] Tara N. Sainath,et al. A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[11] Colin Raffel,et al. Online and Linear-Time Attention by Enforcing Monotonic Alignments , 2017, ICML.

[12] Yonghong Yan,et al. Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2019, INTERSPEECH.

[13] Tara N. Sainath,et al. Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Ankur Kumar,et al. Utterance Confidence Measure for End-to-End Speech Recognition with Applications to Distributed Speech Recognition Scenarios , 2020, INTERSPEECH.

[15] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Bongshin Lee,et al. Voice typing: a new speech interaction model for dictation on touchscreen devices , 2012, CHI.

[17] Chanwoo Kim,et al. Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition , 2020, INTERSPEECH.

[18] Hideyuki Tachibana,et al. Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[20] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Hermann Ney,et al. RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition , 2018, ACL.

[22] Dhananjaya N. Gowda,et al. Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System , 2019, INTERSPEECH.

[23] Quoc V. Le,et al. A Neural Transducer , 2015, 1511.04868.

[24] Rohit Prabhavalkar,et al. On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[25] Li Shuangfeng,et al. TensorFlow Lite: On-Device Machine Learning Framework , 2020 .

[26] Dhananjaya N. Gowda,et al. End-to-End Training of a Large Vocabulary End-to-End Speech Recognition System , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27] Ankur Kumar,et al. Improved Multi-Stage Training of Online Attention-Based Encoder-Decoder Models , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28] Sathish Reddy Indurthi,et al. Small Energy Masking for Improved Neural Network Training for End-To-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[30] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Tara N. Sainath,et al. Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Yu Zhang,et al. Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[33] Colin Raffel,et al. Monotonic Chunkwise Attention , 2017, ICLR.

[34] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.