Automatic Speaker Localization and Tracking: Using a Fusion of the Filtered Correlation with the Energy Differential

This paper presents a system of speaker localization for a purpose of speaker tracking by camera. The authors use the information given by the two microphones, placed in opposition, to determine the position of the active speaker in trying to supervise the audio-visual recording. To achieve the speaker localization task, the authors have proposed and employed two methods, which are called respectively: the filtered correlation method and the energy differential method. The principle of the first method is based on the calculation of the correlation between the two signals collected by the two microphones and a special filtering. The second is based on the computation of the logarithmic energy differential between these two signals. However, when different methods are used simultaneously to make a decision, it is often interesting to use a fusion technique combining those estimations or decisions in order to enhance the system performances. For that purpose, this paper proposes two fusion techniques operating at the decision level which are used to fuse the two estimations into one that should be more precise. speaker tracking according to the information given by all the sensors. Tracking technology is required both to keep the camera focused on the speaker and to display audience members when they talk. There are four general classes of tracking technology: sensor-based, motion-based, microphone-arraybased and speaker-recognition-based. While all the four methods can be used for a single speaker, only the third and the last ones are DOI: 10.4018/jmcmc.2010070102 16 International Journal of Mobile Computing and Multimedia Communications, 2(3), 15-33, July-September 2010 Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. normally used for multi-speaker audience (Liu, Rui, Gupta, & Cadiz, 2000). In the context of automatic analysis of meetings, robust localization and tracking of active speakers is of fundamental importance, particularly for enhancement and recognition of speech in microphone-array based ASR (Automatic Speaker Recognition) systems. Microphone arrays provide hands-free and high-quality distant speech acquisition through beamforming techniques, which rely on speaker location for speech enhancement (Cox et al., 1987). Furthermore, localization and tracking of active speakers from multiple far-field microphones are challenging tasks in smart room scenarios, where the speech signal is corrupted with noise from presentation devices and room reverberations (Maganti & Perez, 2006). Sound source localization is defined as the determination of the coordinates of sound sources in relation to a point in space. It is achieved by using differences in the sound source received by different microphones to estimate the direction and if possible the actual location of the sound source. For example, human ears act as two different sound observation points, enabling humans to estimate the direction of source of the sound (Ui-Hyun, Jinsung, Doik, Hyogon, & Bum-Jae, 2008). So how can these ears make an estimation of the speaker position? To try to respond to the question, or at least simulate this faculty with two opposite cardioid microphones, we have done a thorough experimental investigation on two new proposed techniques based on the filtered correlation and the energy differential, which led us to several interesting results. However, since we have implemented two different methods of speaker localization and since the two detection decisions of these methods are not necessarily similar, we have proposed and implemented two fusion techniques, in order to improve the precision of speaker localization and tracking. sPeecH dAtAbAse We have built four experimental databases with different scenarios, different speakers and different configurations: • DB8 database: the distance between the two microphones is 4.20 m. • DB9 database: the distance between the two microphones is 2 m. • DB10 database: the distance between the two microphones is 1 m. • DB11 database: the distance between the two microphones is 1 m. In this paper, we will describe only the experiments done on DB11 database, since the results got with long distances (DB8 and DB9) are very affected by the echo effect, and those obtained on the DB10 are insufficient. The DB11 database contains several scenarios with different speakers speaking alternatively in a natural manner and with different configurations. There are two general configurations: a stable configuration and a mobile configuration. In the stable configuration, the speakers are seated at one of the 3 fixed positions: Left, Middle or Right (Figure 1.a and Figure 1.b) in a same line. In the mobile configuration, the speaker walks smoothly from one side to the other (e.g., from the left to the right). The distance between the two microphones is 1m, the number of scenarios is 11and the number of speakers is 7 (4 female and 3 male speakers). The signals collected by the 2 cardioid microphones are sampled at a frequency of about 44 kHz and 16 bits, with a stereophonic acquisition. sound fIeld descrIPtIon Various techniques exist that may be used to passively locate an acoustic source in a sound field (Lathoud, 2006). Each of the techniques 17 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/automatic-speaker-localizationtracking/46121?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Communications and Social Science, InfoSciSelect, InfoSci-Select, InfoSci-Communications, Online Engagement, and Media eJournal Collection, InfoSciNetworking, Mobile Applications, and Web Technologies eJournal Collection. Recommend this product to your

[1]  Hyogon Kim,et al.  Speaker localization using the TDOA-based feature matrix for a humanoid robot , 2008, RO-MAN 2008 - The 17th IEEE International Symposium on Robot and Human Interactive Communication.

[2]  In Lee Mobile Services Industries, Technologies, and Applications in the Global Economy , 2012 .

[3]  Gongjun Yan,et al.  A probabilistic routing protocol in VANET , 2009, MoMM.

[4]  Anoop Gupta,et al.  Automating camera management for lecture room environments , 2001, CHI.

[5]  Daniel Gatica-Perez,et al.  Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech , 2006, ICMI '06.

[6]  David Taniar,et al.  Mobile Computing: Concepts, Methodologies, Tools, and Applications , 2008 .

[7]  Gerald C. Lauchle Effect of Turbulent Boundary Layer Flow on Measurement of Acoustic Pressure and Intensity , 1984 .

[8]  Guillaume Lathoud,et al.  Spatio-Temporal Analysis of Spontaneous Speech with Microphone Arrays , 2006 .

[9]  Arun Ross,et al.  An introduction to biometric recognition , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Nir Kshetri,et al.  China: M-Commerce in World's Largest Mobile Market , 2006 .

[11]  Enzo Mumolo,et al.  Algorithms for acoustic localization based on microphone array in service robotics , 2003, Robotics Auton. Syst..

[12]  David Taniar,et al.  International Journal of Mobile Computing and Multimedia Communications , 2010 .

[13]  David Taniar Encyclopedia of Mobile Computing and Commerce , 2007 .

[14]  Javier Ruiz Hidalgo,et al.  Integration of audiovisual sensors and technologies in a smart room , 2007, Personal and Ubiquitous Computing.

[15]  Felix Schaeffler,et al.  A methodological study into the linguistic dimensions of pitch range differences between German and English , 2008, Speech Prosody 2008.

[16]  Halim Sayoud,et al.  Speaker Discrimination on Broadcast News and Telephonic Calls Using a Fusion of Neural and Statistical Classifiers , 2009, Int. J. Mob. Comput. Multim. Commun..