The DIRHA Portuguese Corpus: A Comparison of Home Automation Command Detection and Recognition in Simulated and Real Data

In this paper, we describe a new corpus -named DIRHA-L2F RealCorpus- composed of typical home automation speech interactions in European Portuguese that has been recorded by the INESC-ID’s Spoken Language Systems Laboratory (L2F) to support the activities of the Distant-speech Interaction for Robust Home Applications (DIRHA) EU-funded project. The corpus is a multi-microphone and multi-room database of real continuous audio sequences containing read phonetically rich sentences, read and spontaneous keyword activation sentences, and read and spontaneous home automation commands. The background noise conditions are controlled and randomly recreated with noises typically found in home environments. Experimental validation on this corpus is reported in comparison with the results obtained on a simulated corpus using a fully automated speech processing pipeline for two fundamental automatic speech recognition tasks of typical ‘always-listening’ home-automation scenarios: system activation and voice command recognition. Attending to results on both corpora, the presence of overlapping voice-like noise is shown as the main problem: simulated sequences contain concurrent speakers that result in general in a more challenging corpus, while real sequences performance drops drastically when TV or radio is on.

[1]  John W. McDonough,et al.  Multi-source far-distance microphone selection and combination for automatic transcription of lectures , 2006, INTERSPEECH.

[2]  Martin Wolf,et al.  On the potential of channel selection for recognition of reverberated speech with multiple microphones , 2010, INTERSPEECH.

[3]  Martin Wolf,et al.  Channel selection and reverberation-robust automatic speech recognition , 2013 .

[4]  A. Gareta,et al.  A multi-microphone approach to speech processing in a smart-room environment , 2007 .

[5]  Maurizio Omologo,et al.  Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[6]  Yasunari Obuchi Multiple-microphone robust speech recognition using decoder-based channel selection , 2004, SAPA@INTERSPEECH.

[7]  Petros Maragos,et al.  ATHENA: a Greek multi-sensory database for home automation control uthor: isidoros rodomagoulakis (NTUA, Greece) , 2014, INTERSPEECH.

[8]  Christophe Beaugeant,et al.  Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals , 2011, 2011 19th European Signal Processing Conference.

[9]  Ciro Martins,et al.  The design of a large vocabulary speech corpus for portuguese , 1997, EUROSPEECH.

[10]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[11]  Gerasimos Potamianos,et al.  Automatic Speech Recognition and Speech Activity Detection in the CHIL Smart Room , 2005, MLMI.

[12]  Brigitte Meillon,et al.  The Sweet-Home speech and multimodal corpus for home automation interaction , 2014, LREC.

[13]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[14]  Khalid Choukri,et al.  Data Collection for the CHIL CLEAR 2007 Evaluation Campaign , 2008, LREC.

[15]  Michel Vacher,et al.  Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions , 2011, INTERSPEECH.

[16]  Ramón Fernández Astudillo,et al.  Recognition of Distant Voice Commands for Home Applications in Portuguese , 2014, IberSPEECH.

[17]  Richard M. Stern,et al.  Microphone array processing for robust speech recognition , 2003 .

[18]  Eduardo Lleida,et al.  Performance comparison of several adaptive schemes for microphone array beamforming , 1999, EUROSPEECH.

[19]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[20]  Angelo Farina,et al.  Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique , 2000 .