Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection

A method that exploits an information theoretic framework to extract optimized audio features using video information is presented. A simple measure of mutual information (MI) between the resulting audio and video features allows the detection of the active speaker among different candidates. This method involves the optimization of an Mi-based objective function. No approximation is needed to solve this optimization problem, neither for the estimation of the probability density functions (pdfs) of the features, nor for the cost function itself. The pdfs are estimated from the samples using a nonparametric approach. The challenging optimization problem is solved using a global method: the differential evolution algorithm. Two information theoretic optimization criteria are compared and their ability to extract audio features specific to speech production is discussed. Using these specific audio features, candidate video features are then classified as member of the "speaker" or "non-speaker" class, resulting in a speaker detection scheme. As a result, our method achieves a speaker detection rate of 100% on in-house test sequences, and of 85% on most commonly used sequences.

[1]  Murat Kunt,et al.  Hypothesis Testing as a Performance Evaluation Method for Multimodal Speaker Detection , 2006 .

[2]  Yong Gao,et al.  Comments on "Theoretical analysis of evolutionary algorithms with an infinite population size in continuous space. I. Basic properties of selection and mutation" [and reply] , 1998, IEEE Trans. Neural Networks.

[3]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[4]  Jean-Philippe Thiran,et al.  Feature space mutual information in speech-video sequences , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[5]  Jean-Philippe Thiran,et al.  Face Detection with Mixtures of Boosted Discriminant Features , 2005 .

[6]  Pierre Vandergheynst,et al.  Experimental evaluation framework for speaker detection on the CUAVE database , 2006 .

[7]  Richard D. Deveaux,et al.  Applied Smoothing Techniques for Data Analysis , 1999, Technometrics.

[8]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[9]  R. Storn,et al.  Differential Evolution - A simple and efficient adaptive scheme for global optimization over continuous spaces , 2004 .

[10]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[11]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[12]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[13]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[14]  J.C. Principe,et al.  A methodology for information theoretic feature extraction , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[15]  Yuping Wang,et al.  An orthogonal genetic algorithm with quantization for global numerical optimization , 2001, IEEE Trans. Evol. Comput..

[16]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[17]  Fred W. Glover,et al.  Future paths for integer programming and links to artificial intelligence , 1986, Comput. Oper. Res..

[18]  Francesco Palmieri,et al.  Theoretical analysis of evolutionary algorithms with an infinite population size in continuous space. Part I: Basic properties of selection and mutation , 1994, IEEE Trans. Neural Networks.

[19]  Jacqueline Le Moigne,et al.  Multiresolution registration of remote sensing imagery by optimization of mutual information using a stochastic gradient , 2003, IEEE Trans. Image Process..

[20]  Murat Kunt,et al.  School of Engineering -sti Signal Processing Institute Information Theoretic Optimization of Audio Features for Multimodal Speaker Detection Information Theoretic Optimization of Audio Features for Multimodal Speaker Detection , 2022 .

[21]  Arthur C. Sanderson,et al.  Minimal representation multisensor fusion using differential evolution , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[22]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[23]  Jean-Philippe Thiran,et al.  From error probability to information theoretic (multi-modal) signal processing , 2005, Signal Process..

[24]  Jean-Philippe Thiran,et al.  A multimodal approach to extract optimized audio features for speaker detection , 2005, 2005 13th European Signal Processing Conference.

[25]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  William H. Press,et al.  Numerical recipes in C , 2002 .

[28]  Reto Meuli,et al.  Robust parameter estimation of intensity distributions for brain magnetic resonance images , 1998, IEEE Transactions on Medical Imaging.

[29]  Pierre Vandergheynst,et al.  Analysis of multimodal signals using redundant representations , 2005, IEEE International Conference on Image Processing 2005.

[30]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[31]  Riccardo Poli,et al.  New ideas in optimization , 1999 .

[32]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[33]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[34]  Tomasz Spalek,et al.  Application of the Genetic Algorithm Joint with the Powell Method to Nonlinear Least-Squares Fitting of Powder EPR Spectra , 2005, J. Chem. Inf. Model..

[35]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[36]  Vincent Vaerman Multi-dimensional object modeling with application to medical image coding , 1999 .