Audio head pose estimation using the direct to reverberant speech ratio

Head pose is an important cue in many applications such as, speech recognition and face recognition. Most approaches to head pose estimation to date have used visual information to model and recognise a subject's head in different configurations. These approaches have a number of limitations such as, inability to cope with occlusions, changes in the appearance of the head, and low resolution images. We present here a novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a highly reverberant meeting room environment. Our hypothesis is that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This hypothesis is confirmed by experiments conducted on the publicly available AV16.3 database.

[1]  Parham Aarabi,et al.  Enhanced sound localization , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[2]  Josef Kittler,et al.  A dictionary learning approach to tracking , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  D. Mershon,et al.  Intensity and reverberation as factors in the auditory perception of egocentric distance , 1975 .

[4]  Mohan M. Trivedi,et al.  Head Pose Estimation in Computer Vision: A Survey , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Harvey F. Silverman,et al.  A baseline algorithm for estimating talker orientation using acoustical data from a large-aperture microphone array , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Martin Cooke,et al.  Binaural Estimation of Sound Source Distance via the Direct-to-Reverberant Energy Ratio for Static and Moving Sources , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[8]  Montse Pardàs,et al.  Audiovisual Head Orientation Estimation with Particle Filtering in Multisensor Scenarios , 2008, EURASIP J. Adv. Signal Process..

[9]  Christophe Beaugeant,et al.  Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals , 2011, 2011 19th European Signal Processing Conference.

[10]  Montse Pardàs,et al.  Deleted DOI: Audiovisual Head Orientation Estimation with Particle Filtering in Multisensor Scenarios , 2008 .

[11]  Climent Nadeu,et al.  Audio-based approaches to head orientation estimation in a smart-room , 2007, INTERSPEECH.

[12]  Mohan M. Trivedi,et al.  Role of head pose estimation in speech acquisition from distant microphones , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Harvey F. Silverman,et al.  Characterization of talker radiation pattern using a microphone array , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Harvey F. Silverman,et al.  A Robust Method to Extract Talker Azimuth Orientation Using a Large-Aperture Microphone Array , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tammo Houtgast,et al.  Auditory distance perception in rooms , 1999, Nature.