Speech recognition system for a service robot - a performance evaluation

In this work we adapt and evaluate different solutions for automatic speech recognition (ASR) to be used as an HMI for the assistant robot. Two on-device solutions: Kaldi (DNN-HMM) and Mozilla's DeepSpeech (end-to-end), and three internet service APIs: IBM Watson, Microsoft Azure and Google Speech to Text are evaluated. The systems are adapted to the domain of robot commands and evaluated on a set of expected inputs. As the goal is to retain the ability to recognise general language, the systems are also evaluated on out of domain data.

[1]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[2]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[7]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).