Model-based Head Orientation Estimation for Smart Devices

Voice interaction is friendly and convenient for users. Smart devices such as Amazon Echo allow users to interact with them by voice commands and become increasingly popular in our daily life. In recent years, research works focus on using the microphone array built in smart devices to localize the user's position, which adds additional context information to voice commands. In contrast, few works explore the user's head orientation, which also contains useful context information. For example, when a user says, "turn on the light", the head orientation could infer which light the user is referring to. Existing model-based works require a large number of microphone arrays to form an array network, while machine learning-based approaches need laborious data collection and training workload. The high deployment/usage cost of these methods is unfriendly to users. In this paper, we propose HOE, a model-based system that enables Head Orientation Estimation for smart devices with only two microphone arrays, which requires a lower training overhead than previous approaches. HOE first estimates the user's head orientation candidates by measuring the voice energy radiation pattern. Then, the voice frequency radiation pattern is leveraged to obtain the final result. Real-world experiments are conducted, and the results show that HOE can achieve a median estimation error of 23 degrees. To the best of our knowledge, HOE is the first model-based attempt to estimate the head orientation by only two microphone arrays without the arduous data training overhead.

[1]  Carlos Segura,et al.  GCC-PHAT based Head Orientation Estimation , 2012, INTERSPEECH.

[2]  Yunhao Liu,et al.  Symphony: localizing multiple acoustic sources with a single microphone array , 2020, SenSys.

[3]  Alessio Brutti,et al.  Inference of acoustic source directivity using environment awareness , 2011, 2011 19th European Signal Processing Conference.

[4]  Akira Sasou,et al.  Acoustic head orientation estimation applied to powered wheelchair control , 2009, 2009 Second International Conference on Robot Communication and Coordination.

[5]  Will Hedgecock,et al.  Weapon classification and shooter localization using distributed multichannel acoustic sensors , 2011, J. Syst. Archit..

[6]  Steven van de Par,et al.  Head-Orientation-Based Device Selection: Are You Talking to Me? , 2016, ITG Symposium on Speech Communication.

[7]  Masahito Togami,et al.  Head orientation estimation of a speaker by utilizing kurtosis of a DOA histogram with restoration of distance effect , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Carlos Segura,et al.  Multimodal Head Orientation Towards Attention Tracking in Smartrooms , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[9]  Charles E Taylor,et al.  Acoustic localization of antbirds in a Mexican rainforest using a wireless sensor network. , 2010, The Journal of the Acoustical Society of America.

[10]  James A. Landay,et al.  Soundr: Head Position and Orientation Prediction Using a Microphone Array , 2020, CHI.

[11]  Carlos Segura,et al.  3D Joint Speaker Position and Orientation Tracking with Particle Filters , 2014, Sensors.

[12]  Kazuhiro Nakadai,et al.  Real-time sound source orientation estimation using a 96 channel microphone array , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Climent Nadeu,et al.  Audio person tracking in a smart-room environment , 2006, INTERSPEECH.

[14]  Climent Nadeu,et al.  Speaker orientation estimation based on hybridation of GCC-PHAT and HLBR , 2008, INTERSPEECH.

[15]  Cha Zhang,et al.  Turning enemies into friends: Using reflections to improve sound source localization , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[16]  Carlos Segura Perales Speaker localization and orientation in multimodal smart environments , 2011 .

[17]  Seiichi Nakagawa,et al.  Automatic estimation of position and orientation of an acoustic source by a microphone array network. , 2009, The Journal of the Acoustical Society of America.

[18]  Kazuhiro Nakadai,et al.  Sound source tracking with directivity pattern estimation using a 64 ch microphone array , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Harvey F. Silverman,et al.  A baseline algorithm for estimating talker orientation using acoustical data from a large-aperture microphone array , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Sheng Shen,et al.  Voice localization using nearby wall reflections , 2020, MobiCom.

[21]  Norihiro Hagita,et al.  Using multiple microphone arrays and reflections for 3D localization of sound sources , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[23]  Tetsuya Takiguchi,et al.  Estimation of Talker's Head Orientation Based on Discrimination of the Shape of Cross-power Spectrum Phase Coefficients , 2012, INTERSPEECH.

[24]  Harvey F. Silverman,et al.  A new algorithm for the estimation of talker azimuthal orientation using a large aperture microphone array , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[25]  Romit Roy Choudhury,et al.  Daredevil: indoor location using sound , 2014, MOCO.

[26]  Alessio Brutti,et al.  Environment aware estimation of the orientation of acoustic sources using a line array , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[27]  Seiichi Nakagawa,et al.  Directional Acoustic Source'S Position and Orientation Estimation Approach by a Microphone Array Network , 2009, 2009 IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop.

[28]  Zhao Wang,et al.  AcouRadar: Towards Single Source based Acoustic Localization , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[29]  Chao Wang,et al.  Raw Waveform Based End-to-end Deep Convolutional Network for Spatial Localization of Multiple Acoustic Sources , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Peter Robinson,et al.  OpenFace: An open source facial behavior analysis toolkit , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[31]  Climent Nadeu,et al.  Audio-based approaches to head orientation estimation in a smart-room , 2007, INTERSPEECH.

[32]  Tetsuya Takiguchi,et al.  Single-Channel Head Orientation Estimation Based on Discrimination of Acoustic Transfer Function , 2011, INTERSPEECH.

[33]  Harvey F. Silverman,et al.  A Robust Method to Extract Talker Azimuth Orientation Using a Large-Aperture Microphone Array , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  John Mourjopoulos,et al.  Speaker Distance Detection Using a Single Microphone , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Alessio Brutti,et al.  Oriented global coherence field for the estimation of the head orientation in smart rooms equipped with distributed microphone arrays , 2005, INTERSPEECH.

[36]  Shyh-Jye Jou,et al.  A Systematic ANSI S1.11 Filter Bank Specification Relaxation and Its Efficient Multirate Architecture for Hearing-Aid Systems , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Alexander H. Waibel CHIL - Computers in the Human Interaction Loop , 2005, MVA.

[38]  R. Young Sabine Reverberation Equation and Sound Power Calculations , 1957 .

[39]  Karan Ahuja,et al.  Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Devices Ecosystems , 2020, UIST.

[40]  Parham Aarabi,et al.  Enhanced sound localization , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[41]  Ivan Tashev,et al.  Sound Capture and Processing: Practical Approaches , 2009 .

[42]  Alessio Brutti,et al.  Classification of Acoustic Maps to Determine Speaker Position and Orientation from a Distributed Microphone Network , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.