Analyzing the performance of ASR systems: The effects of noise, distance to the device, age and gender

In a Natural Language Interaction (NLI) solution, the efficiency of the Automatic Speech Recognition (ASR) component is a key issue. Considering this, the paper presents an analysis of the performance of three ASR systems on several noise scenarios resembling the interaction with the TV in a domestic environment. The evaluation setup resorted to commonly used input devices for voice interaction with a TV/Set-top Box: remote control with a microphone and two far-field microphones placed at different distances to the user. The analyses focused on cloud-based ASR systems (Google, Bing, and Nuance) that can be used in NLI approaches for Interactive Television in European Portuguese (EP), investigating the possible influence of noise, distance, gender and age on their performance. The results showed that Google is the most robust system followed by Bing and Nuance. The ASR performance tends to deteriorate with background noise and/or when the distance between the user and the input device increases. The ASR performance for Bing and Nuance is significantly affected by age but not for Google. All three ASR systems proved to be robust to gender variation. This work aimed at a better understanding of the behavior of ASR systems to operate in EP in different background noise scenarios considering that this is one of the languages that is still not in the training priorities of the main ASR players.

[1]  Pontus Johansson,et al.  Multimodal Dialogue Systems: A Case Study for Interactive TV , 2002, User Interfaces for All.

[2]  David DeVault,et al.  Towards Natural Language Understanding of Partial Speech Recognition Results in Dialogue Systems , 2009, HLT-NAACL.

[3]  João Pedro Cordeiro Rato Conversação homem-máquina. Caracterização e avaliação do estado actual das soluções de speech recognition, speech synthesis e sistemas de conversação homem-máquina , 2016 .

[4]  Veton Kepuska,et al.  Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx) , 2017 .

[5]  Michel Vacher,et al.  Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command , 2018, Int. J. Speech Technol..

[6]  Masaru Miyazaki,et al.  A Spoken Dialogue Interface for TV Operations Based on Data Collected by Using WOZ Method , 2004, IEICE Trans. Inf. Syst..

[7]  Markku Turunen,et al.  User expectations and user experience with different modalities in a mobile phone controlled home entertainment system , 2009, Mobile HCI.

[8]  Matthew E. Tolentino,et al.  Evaluating Voice Interaction Pipelines at the Edge , 2017, 2017 IEEE International Conference on Edge Computing (EDGE).

[9]  Philip N. Garner,et al.  Automatic speech recognition and translation of a Swiss German dialect: Walliserdeutsch , 2014, INTERSPEECH.

[10]  Katsutoshi Itoyama,et al.  Automatic Speech Recognition for Mixed Dialect Utterances by Mixing Dialect Language Models , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Rong-Chang Li,et al.  ZIP & TERRY: A New Attempt at Designing Language Learning Simulation , 2002 .

[12]  Li Deng,et al.  Why word error rate is not a good metric for speech recognizer training for the speech translation task? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Ravichander Vipperla,et al.  Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization , 2011 .

[14]  Isabel Trancoso,et al.  Automatically Recognising European Portuguese Children's Speech - Pronunciation Patterns Revealed by an Analysis of ASR Errors , 2014, PROPOR.

[15]  Jon Barker,et al.  The CHiME Challenges: Robust Speech Recognition in Everyday Environments , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[16]  Isabel Trancoso,et al.  A corpus-based study of elderly and young speakers of European Portuguese: acoustic correlates and their impact on speech recognition performance , 2013, INTERSPEECH.

[17]  Syed Abdul Rahman Al-Haddad,et al.  Distant Speaker Recognition: An Overview , 2016, Int. J. Humanoid Robotics.