Recent Progresses On Deep Learning For Speech Recognition

We discuss two important areas in deep learning based automatic speech recognition (ASR) where significant research attention has been given recently: end-to-end (E2E) modeling and robust ASR. E2E modeling aims at simplifying the modeling pipeline and reducing the dependency on domain knowledge by introducing sequence-to-sequence translation models. These models usually optimize the ASR objectives end-to-end with few assumptions, and can potentially improve the ASR performance when abundant training data is available. Robustness is critical to, but is still less than desired in, practical ASR systems. Many new attempts, such as teacher-student learning, adversarial training, improved speech separation and enhancement, have been made to improve the systems’ robustness. We summarize the recent progresses in these two areas with a focus on the successful technologies proposed and the insights behind them. We also discuss possible research directions.