On integrating insights from human speech perception into automatic speech recognition

In spite of the effort and progress made during the last few decades, the performance of automatic speech recognition (ASR) systems still lags far behind that achieved by humans. Some researchers think that more speech data will be sufficient in order to bridge this performance gap. Others think that radical modifications to the current methods need to be made, and possible inspirations for these modifications should come from human speech perception (HSP). This paper focuses on two issues: first, it presents a comparison between HSP and ASR emphasizing some insights from HSP that could still be applied in ASR; second, it presents some ideas for extracting useful non-linguistic information from the speech signal, the so called ‘rich transcription’, which could help in selecting specialized acoustic-linguistic models that offer higher accuracy than the general models.