Audio head pose estimation using the direct to reverberant speech ratio

Head pose is an important cue in many applications such as, speech recognition and face recognition. Most approaches to head pose estimation to date have used visual information to model and recognise a subject's head in different configurations. These approaches have a number of limitations such as, inability to cope with occlusions, changes in the appearance of the head, and low resolution images. We present here a novel method for determining coarse head pose orientation purely from audio information, exploiting the direct to reverberant speech energy ratio (DRR) within a highly reverberant meeting room environment. Our hypothesis is that a speaker facing towards a microphone will have a higher DRR and a speaker facing away from the microphone will have a lower DRR. This hypothesis is confirmed by experiments conducted on the publicly available AV16.3 database.

[1]  Alessio Brutti,et al.  Environment aware estimation of the orientation of acoustic sources using a line array , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[2]  John W. McDonough,et al.  An Audio-Visual Particle Filter for Speaker Tracking on the CLEAR'06 Evaluation Dataset , 2006, CLEAR.

[3]  Andrea Cavallaro,et al.  Multi-Modal Particle Filtering Tracking using Appearance, Motion and Audio Likelihoods , 2007, 2007 IEEE International Conference on Image Processing.

[4]  Josef Kittler,et al.  Audio head pose estimation using the direct to reverberant speech ratio , 2013, ICASSP.

[5]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[6]  J. Odobez,et al.  AV 16 . 3 : An Audio-Visual Corpus for Speaker Localization and Tracking , .

[7]  Climent Nadeu,et al.  Audio-based approaches to head orientation estimation in a smart-room , 2007, INTERSPEECH.

[8]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[9]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Parham Aarabi,et al.  Enhanced sound localization , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[12]  Domingo Mery,et al.  Head Tracking For 3d Audio Using The Nintendo Wii Remote , 2010, ICMC.

[13]  Mohan M. Trivedi,et al.  Head Pose Estimation in Computer Vision: A Survey , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  W. T. Chu,et al.  Directivity of human talkers , 2001 .

[15]  Miao Yu,et al.  A Multimodal Approach to Blind Source Separation of Moving Sources , 2010, IEEE Journal of Selected Topics in Signal Processing.

[16]  Josef Kittler,et al.  A dictionary learning approach to tracking , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Josef Kittler,et al.  Audio Assisted Robust Visual Tracking With Adaptive Particle Filtering , 2015, IEEE Transactions on Multimedia.

[18]  Carlos Segura,et al.  Multimodal Head Orientation Towards Attention Tracking in Smartrooms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Michael S. Brandstein,et al.  A hybrid real-time face tracking system , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[20]  D. W. Farnsworth,et al.  Exploration of Pressure Field Around the Human Head During Speech , 1938 .

[21]  Harvey F. Silverman,et al.  A baseline algorithm for estimating talker orientation using acoustical data from a large-aperture microphone array , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Alessio Brutti,et al.  Classification of Acoustic Maps to Determine Speaker Position and Orientation from a Distributed Microphone Network , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[23]  Simon J. Godsill,et al.  JOINT ACOUSTIC SOURCE LOCATION AND ORIENTATION ESTIMATION USING SEQUENTIAL MONTE CARLO , 2006 .

[24]  Martin Cooke,et al.  Binaural Estimation of Sound Source Distance via the Direct-to-Reverberant Energy Ratio for Static and Moving Sources , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  JongSuk Choi,et al.  Audio-visual data fusion for tracking the direction of multiple speakers , 2010, ICCAS 2010.

[26]  Montse Pardàs,et al.  Audiovisual Head Orientation Estimation with Particle Filtering in Multisensor Scenarios , 2008, EURASIP J. Adv. Signal Process..

[27]  Christophe Beaugeant,et al.  Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals , 2011, 2011 19th European Signal Processing Conference.

[28]  Oswald Lanz,et al.  A joint particle filter to track the position and head orientation of people using audio visual cues , 2010, 2010 18th European Signal Processing Conference.

[29]  Harvey F. Silverman,et al.  Characterization of talker radiation pattern using a microphone array , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Harvey F. Silverman,et al.  A Robust Method to Extract Talker Azimuth Orientation Using a Large-Aperture Microphone Array , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Jean-Marc Odobez,et al.  Multiperson Visual Focus of Attention from Head Pose and Meeting Contextual Cues , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  W. Marsden I and J , 2012 .

[33]  Josef Kittler,et al.  Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling , 2014, IEEE Transactions on Multimedia.

[34]  Alessio Brutti,et al.  Oriented global coherence field for the estimation of the head orientation in smart rooms equipped with distributed microphone arrays , 2005, INTERSPEECH.

[35]  Mohan M. Trivedi,et al.  Role of head pose estimation in speech acquisition from distant microphones , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  D. Mershon,et al.  Intensity and reverberation as factors in the auditory perception of egocentric distance , 1975 .

[37]  H. Sabine Room Acoustics , 1953, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[38]  Tammo Houtgast,et al.  Auditory distance perception in rooms , 1999, Nature.

[39]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[40]  Carolyn Davis,et al.  Sound system engineering , 1975 .