A domain-mismatch speech recognition system in radio communication based on improved spectrum augmentation

Most of the current automatic speech recognition systems (ASR) require that the data domain of the test set and training set are similar to obtain satisfactory accuracy. However, when the data style of the two are different, such as different background noise, different sampling rates, and different speech speeds, etc., they will have a bad influence on the recognition results. In some areas like radio communications, the performance of the acoustics models trained by the pure and noise-free Mandarin speech data which is relatively easy to obtain is still not comparable to that in some other areas because of the different data styles. Therefore, we investigate an audio-level speech augmentation method that directly processes the raw signal. The augmentation policy consists of masking blocks of frequency channels randomly and specifically based on Discrete Fourier Transformation (DFT). We apply our method on the radio Mandarin speech recognition task and achieve 24% improvement of Character Error Rate (CER) and 22% of Word Error Rate (WER) over other augmentation policies relatively. Besides, we propose a method of eliminating the impact of the original sampling rate and the experimental results verify the effectiveness of it.