论文信息 - End-to-End Speech Recognition from Federated Acoustic Models

End-to-End Speech Recognition from Federated Acoustic Models

Training Automatic Speech Recognition (ASR) models under federated learning (FL) settings has attracted a lot of attention recently. However, the FL scenarios often presented in the literature are artificial and fail to capture the complexity of real FL systems. In this paper, we construct a challenging and realistic ASR federated experimental setup consisting of clients with heterogeneous data distributions using the French and Italian sets of the CommonVoice dataset, a large heterogeneous dataset containing thousands of different speakers, acoustic environments and noises. We present the first empirical study on attention-based sequence-to-sequence Endto-End (E2E) ASR model with three aggregation weighting strategies – standard FedAvg, loss-based aggregation and a novel word error rate (WER)-based aggregation, compared in two realistic FL scenarios: cross-silo with 10 clients and cross-device with 2K and 4K clients. Our analysis on E2E ASR from heterogeneous and realistic federated acoustic models provides the foundations for future research and development of realistic FL-based ASR applications.

[1] Titouan Parcollet,et al. Flower: A Friendly Federated Learning Research Framework , 2020, ArXiv.

[2] Nicholas D. Lane,et al. FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout , 2021, NeurIPS.

[3] Rogier C. van Dalen,et al. Improving on-device speaker verification using federated learning with privacy , 2020, INTERSPEECH.

[4] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Hervé Bredin,et al. TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Klaus-Robert Müller,et al. Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[9] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[10] Yashesh Gaur,et al. A Federated Approach in Training Acoustic Models , 2020, INTERSPEECH.

[11] Xiaodong Cui,et al. Federated Acoustic Modeling for Automatic Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Hagen Soltau,et al. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[13] Manzil Zaheer,et al. Adaptive Federated Optimization , 2020, ICLR.

[14] Françoise Beaufays,et al. Training Speech Recognition Models with Federated Learning: A Quality/Cost Framework , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Bhuvana Ramabhadran,et al. End-to-end speech recognition and keyword search on low-resource languages , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Francis M. Tyers,et al. Common Voice: A Massively-Multilingual Speech Corpus , 2020, LREC.

[17] Daniel J. Beutel,et al. A first look into the carbon footprint of federated learning , 2020, J. Mach. Learn. Res..

[18] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Luke S. Zettlemoyer,et al. Transformers with convolutional context for ASR , 2019, ArXiv.

[20] Pavel Korshunov,et al. Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Yue Zhao,et al. Federated Learning with Non-IID Data , 2018, ArXiv.

[22] Niranjan A. Subrahmanya,et al. Training Keyword Spotting Models on Non-IID Data with Federated Learning , 2020, INTERSPEECH.

[23] Simon Dixon,et al. PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[25] Gianmarco Baldini,et al. An Evaluation of Entropy Measures for Microphone Identification , 2020, Entropy.

[26] Fernando Pereira,et al. Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[27] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28] Hung-yi Lee,et al. Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Anit Kumar Sahu,et al. Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[30] Hairong Liu,et al. Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[32] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[33] Anit Kumar Sahu,et al. Federated Learning: Challenges, Methods, and Future Directions , 2019, IEEE Signal Processing Magazine.

[34] Titouan Parcollet,et al. SpeechBrain: A General-Purpose Speech Toolkit , 2021, ArXiv.

[35] Joseph Dureau,et al. Federated Learning for Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Akshi Kumar,et al. A Survey of Deep Learning Techniques in Speech Recognition , 2018, 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN).

[37] Adam Lopez,et al. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation , 2018, NAACL.

[38] Richard Nock,et al. Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[39] Hermann Ney,et al. A Comparison of Transformer and LSTM Encoder Decoder Models for ASR , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).