Data Augmentation with Locally-time Reversed Speech for Automatic Speech Recognition

Psychoacoustic studies have shown that locally-time reversed (LTR) speech, i.e., signal samples time-reversed within a short segment, can be accurately recognised by human listeners. This study addresses the question of how well a state-of-the-art automatic speech recognition (ASR) system would perform on LTR speech. The underlying objective is to explore the feasibility of deploying LTR speech in the training of end-to-end (E2E) ASR models, as an attempt to data augmentation for improving the recognition performance. The investigation starts with experiments to understand the effect of LTR speech on general-purpose ASR. LTR speech with reversed segment duration of 5 ms 50 ms is rendered and evaluated. For ASR training data augmentation with LTR speech, training sets are created by combining natural speech with different partitions of LTR speech. The efficacy of data augmentation is confirmed by ASR results on speech corpora in various languages and speaking styles. ASR on LTR speech with reversed segment duration of 15 ms 30 ms is found to have lower error rate than with other segment duration. Data augmentation with these LTR speech achieves satisfactory and consistent improvement on ASR performance.

[1]  Wolfgang Ellermeier,et al.  Intelligibility of locally time-reversed speech: A multilingual comparison , 2017, Scientific Reports.

[2]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[4]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[5]  Zhiyong Wu,et al.  SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition , 2020, INTERSPEECH.

[6]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[7]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8]  K. Saberi,et al.  Cognitive restoration of reversed speech , 1999, Nature.

[9]  Birger Kollmeier,et al.  Comparing human and automatic speech recognition in simple and complex acoustic scenes , 2018, Comput. Speech Lang..

[10]  Makio Kashino,et al.  Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance , 2021, Interspeech 2021.

[11]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[12]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[13]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[14]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[15]  Wolfgang Ellermeier,et al.  The psychoacoustics of the irrelevant sound effect , 2014 .

[16]  David Poeppel,et al.  Discrimination of speech stimuli based on neuronal response phase patterns depends on acoustics but not comprehension. , 2010, Journal of neurophysiology.

[17]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[18]  Alex Waibel,et al.  Improving Sequence-To-Sequence Speech Recognition Training with On-The-Fly Data Augmentation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[20]  Steven Greenberg,et al.  The relation between speech intelligibility and the complex modulation spectrum , 2001, INTERSPEECH.

[21]  Jeih-Weih Hung,et al.  TENET: A Time-Reversal Enhancement Network for Noise-Robust ASR , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[23]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[24]  M. Kashino,et al.  Perceptual Restoration of Temporally Distorted Speech in L1 vs. L2: Local Time Reversal and Modulation Filtering , 2018, Front. Psychol..

[25]  Tamara C. Cristescu,et al.  Auditory language comprehension of temporally reversed speech signals in native and non-native speakers. , 2008, Acta Neurobiologiae Experimentalis.

[26]  Geoffrey Zweig,et al.  Toward Human Parity in Conversational Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Brian Kingsbury,et al.  English Broadcast News Speech Recognition by Humans and Machines , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Ivan Magrin-Chagnolleau,et al.  Intelligibility of reverse speech in French: a perceptual study , 2002, INTERSPEECH.

[29]  The Psychometrics of Automatic Speech Recognition , 2021 .