Adaptive spectral smoothening for development of robust keyword spotting system

It is well known that a keyword spotting (KWS) system provides significantly reduced performance in mismatched training and test conditions. In this work, an approach is proposed for reducing the mismatches between the training and test speech due to speaker-related variabilities and environmental noises. In the proposed approach, the variational-mode decomposition is first performed on the short-term magnitude spectra to decompose it into a number of variational mode functions (VMFs) in an adaptive manner. Then, a sufficiently smoothed spectra are reconstructed by selecting only two lower frequency VMFs. When the KWS system is developed by using Mel frequency cepstral coefficients (MFCCs) extracted from the smoothed spectra, a significantly improved performance is observed for pitch and noise mismatched test conditions. To further suppress the mismatches due to the pitch and speaking rate of the speakers, data-augmented training based on explicit prosody modification is performed. The experimental results presented in this study show that data-augmented training further enhances the performance of the developed KWS.