Conjugate Mixture Models for the Modeling of Visual and Auditory Perception. (Modèles de Mélanges Conjugués pour la Modélisation de la Perception Visuelle et Auditive)

In this thesis, the modelling of audio-visual perception with a head-like device is considered. The related problems, namely audio-visual calibration, audio-visual object detection, localization and tracking are addressed. A spatio-temporal approach to the head-like device calibration is proposed based on probabilistic multimodal trajectory matching. The formalism of conjugate mixture models is introduced along with a family of efficient optimization algorithms to perform multimodal clustering. One instance of this algorithm family, namely the conjugate expectation maximization (ConjEM) algorithm is further improved to gain attractive theoretical properties. The multimodal object detection and object number estimation methods are developed, their theoretical properties are discussed. Finally, the proposed multimodal clustering method is combined with the object detection and object number estimation strategies and known tracking techniques to perform multimodal multiobject tracking. The performance is demonstrated on simulated data and the database of realistic audio-visual scenarios (CAVA database).

[1]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[2]  Radu Horaud,et al.  Detection and localization of 3d audio-visual objects using unsupervised clustering , 2008, ICMI '08.

[3]  Alfred O. Hero,et al.  Kullback proximal algorithims for maximum-likelihood estimation , 2000, IEEE Trans. Inf. Theory.

[4]  Khalid Choukri,et al.  The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms , 2007, Lang. Resour. Evaluation.

[5]  A. Pouget,et al.  Multisensory spatial representations in eye-centered coordinates for reaching , 2002, Cognition.

[6]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Peter K. Allen,et al.  Integrating Vision and Touch for Object Recognition Tasks , 1988, Int. J. Robotics Res..

[8]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[9]  José A. Castellanos,et al.  Mobile Robot Localization and Map Building: A Multisensor Fusion Approach , 2000 .

[10]  T. Stanford,et al.  Superadditivity in multisensory integration: putting the computation in context. , 2007, Neuroreport.

[11]  Mark T. Wallace,et al.  The influence of visual and auditory receptive field organization on multisensory integration in the superior colliculus , 2001, Experimental Brain Research.

[12]  Tomás Svoboda,et al.  A Convenient Multicamera Self-Calibration for Virtual Environments , 2005, Presence: Teleoperators & Virtual Environments.

[13]  Mary P. Harper,et al.  VACE Multimodal Meeting Corpus , 2005, MLMI.

[14]  A. A. Zhigli︠a︡vskiĭ,et al.  Stochastic Global Optimization , 2007 .

[15]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[16]  Michael I. Miller,et al.  REPRESENTATIONS OF KNOWLEDGE IN COMPLEX SYSTEMS , 1994 .

[17]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[18]  Ivan Himawan,et al.  Microphone Array Shape Calibration in Diffuse Noise Fields , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  A. Zhigljavsky Stochastic Global Optimization , 2008, International Encyclopedia of Statistical Science.

[20]  M. Jacobsen Point Process Theory and Applications: Marked Point and Piecewise Deterministic Processes , 2005 .

[21]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[22]  Hugh F. Durrant-Whyte,et al.  Multisensor data fusion for underwater navigation , 2001, Robotics Auton. Syst..

[23]  Lei Xu Comparative Analysis on Convergence Rates of The EM Algorithm and Its Two Modifications for Gaussian Mixtures , 2004, Neural Processing Letters.

[24]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[25]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[26]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[27]  A. King,et al.  Multisensory Integration: Strategies for Synchronization , 2005, Current Biology.

[28]  Herman Bruyninckx,et al.  Kalman filters for non-linear systems: a comparison of performance , 2004 .

[29]  D. C. Higgins Human Spatial Orientation , 1967, The Yale Journal of Biology and Medicine.

[30]  Manuel Yguel,et al.  Efficient GPU-based Construction of Occupancy Grids Using several Laser Range-finders , 2008 .

[31]  Rainer Stiefelhagen,et al.  Audio-visual multi-person tracking and identification for smart environments , 2007, ACM Multimedia.

[32]  Sophie M. Wuerger,et al.  Low-level integration of auditory and visual motion signals requires spatial co-localisation , 2005, Experimental Brain Research.

[33]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[35]  Patrick Pérez,et al.  Data fusion for visual tracking with particles , 2004, Proceedings of the IEEE.

[36]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[37]  Trevor Darrell,et al.  Multiple person and speaker activity tracking with a particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Radu Horaud,et al.  Cyclopean Geometry of Binocular Vision , 2008, Journal of the Optical Society of America. A, Optics, image science, and vision.

[39]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[40]  Mark T. Wallace,et al.  The influence of visual and auditory receptive field organization on multisensory integration in the superior colliculus , 2002, Experimental Brain Research.

[41]  Gilles Celeux,et al.  EM procedures using mean field-like approximations for Markov model-based image segmentation , 2003, Pattern Recognit..

[42]  N. Kampen,et al.  Stochastic processes in physics and chemistry , 1981 .

[43]  F. Downton Stochastic Approximation , 1969, Nature.

[44]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[45]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[46]  Vladimir Pavlovic,et al.  Boosted learning in dynamic Bayesian networks for multimodal speaker detection , 2003, Proc. IEEE.

[47]  Ryosuke Shibasaki,et al.  Multi-modal tracking of people using laser scanners and video camera , 2008, Image Vis. Comput..

[48]  Sethu Vijayakumar,et al.  Structure Inference for Bayesian Multisensory Perception and Tracking , 2007, IJCAI.

[49]  Sethu Vijayakumar,et al.  Structure Inference for Bayesian Multisensor Scene Understanding , 2007 .

[50]  Andrew Y. Ng,et al.  Integrating Visual and Range Data for Robotic Object Detection , 2008, ECCV 2008.

[51]  Ramani Duraiswami,et al.  Automatic position calibration of multiple microphones , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  Jonathan G. Fiscus,et al.  The NIST Meeting Room Corpus 2 Phase 1 , 2006, MLMI.

[53]  Dongming Zhao,et al.  Unscented Kalman filter for non-linear estimation , 2006 .

[54]  Christophe Biernacki,et al.  Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models , 2003, Comput. Stat. Data Anal..

[55]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[56]  Thomas J. Anastasio,et al.  Using Bayes' Rule to Model Multisensory Enhancement in the Superior Colliculus , 2000, Neural Computation.

[57]  D. Wang,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[58]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[59]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Ning Ma,et al.  Integrating pitch and localisation cues at a speech fragment level , 2007, INTERSPEECH.

[61]  E. Coiras,et al.  Rigid data association for shallow water surveys , 2007 .

[62]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[63]  Martin Cooke,et al.  Motion strategies for binaural localisation of speech sources in azimuth and distance by artificial listeners , 2011, Speech Commun..

[64]  A. King,et al.  The superior colliculus , 2004, Current Biology.

[65]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Jinwen Ma,et al.  Asymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures , 2000, Neural Computation.

[67]  Mikhail Borisovich Nevelʹson,et al.  Stochastic Approximation and Recursive Estimation , 1976 .

[68]  D. Burr,et al.  Combining visual and auditory information. , 2006, Progress in brain research.

[69]  W. Ebeling Stochastic Processes in Physics and Chemistry , 1995 .

[70]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[71]  Arthur C. Sanderson,et al.  Multisensor Fusion - A Minimal Representation Framework , 1999, Series in Intelligent Control and Intelligent Automation.

[72]  A. A. Zhigli︠a︡vskiĭ,et al.  Theory of Global Random Search , 1991 .

[73]  Vladimir Katkovnik,et al.  Spatially Adaptive Estimation via Fitted Local Likelihood Techniques , 2008, IEEE Transactions on Signal Processing.

[74]  H Colonius,et al.  A two-stage model for visual-auditory interaction in saccadic latencies , 2001, Perception & psychophysics.

[75]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[76]  P. Deb Finite Mixture Models , 2008 .

[77]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[78]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[79]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[80]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[81]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[82]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[83]  Daniel P. W. Ellis,et al.  An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments , 2006, NIPS.

[84]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[85]  B. Pannetier,et al.  Improvement of Multiple Ground Targets Tracking with GMTI Sensor and Fusion of Identification Attributes , 2008, 2008 IEEE Aerospace Conference.

[86]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[87]  Samy Bengio,et al.  Modeling human interaction in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[88]  E. Gassiat Likelihood ratio inequalities with applications to various mixtures , 2002 .

[89]  Frank Dellaert,et al.  MCMC-based particle filtering for tracking a variable number of interacting targets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[90]  R. Horaud,et al.  Audio-Visual Clustering for Multiple Speaker Localization , 2008 .

[91]  M. Alex Meredith,et al.  Neurons and behavior: the same rules of multisensory integration apply , 1988, Brain Research.

[92]  Jon Barker,et al.  Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment , 2008, Speech Commun..

[93]  D. L. Hall,et al.  Mathematical Techniques in Multisensor Data Fusion , 1992 .

[94]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[95]  T. Stanford,et al.  Multisensory integration: current issues from the perspective of the single neuron , 2008, Nature Reviews Neuroscience.

[96]  Jian Yao,et al.  Multi-Camera Multi-Person 3D Space Tracking with MCMC in Surveillance Scenarios , 2008, ECCV 2008.

[97]  J. Idier,et al.  Penalized Maximum Likelihood Estimator for Normal Mixtures , 2000 .

[98]  Roberto Brunelli,et al.  A Generative Approach to Audio-Visual Person Tracking , 2006, CLEAR.

[99]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[100]  Yong Rui,et al.  Real-time speaker tracking using particle filter sensor fusion , 2004, Proceedings of the IEEE.

[101]  Martin Cooke,et al.  Modelling auditory processing and organisation , 1993, Distinguished dissertations in computer science.

[102]  Radu Horaud,et al.  Conjugate Mixture Models for Clustering Multimodal Data , 2011, Neural Computation.

[103]  S. P. Mudur,et al.  Three-dimensional computer vision: a geometric viewpoint , 1993 .

[104]  David C. Knill Bayesian Models of Sensory Cue Integration , 2006 .

[105]  S. Wuerger,et al.  Cross-modal integration of auditory and visual motion signals , 2001, Neuroreport.

[106]  Ren C. Luo,et al.  Multisensor fusion and integration: approaches, applications, and future research directions , 2002 .

[107]  G. Celeux,et al.  An entropy criterion for assessing the number of clusters in a mixture model , 1996 .

[108]  Trevor Darrell,et al.  Audio-video array source separation for perceptual user interfaces , 2001, PUI '01.

[109]  Sameer Singh,et al.  Approaches to Multisensor Data Fusion in Target Tracking: A Survey , 2006, IEEE Transactions on Knowledge and Data Engineering.

[110]  Alfred O. Hero,et al.  On EM algorithms and their proximal generalizations , 2008, 1201.5912.

[111]  Alexandre Pouget,et al.  A computational perspective on the neural basis of multisensory spatial representations , 2002, Nature Reviews Neuroscience.

[112]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[113]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[114]  J. Lewald,et al.  Cross-modal perceptual integration of spatially and temporally disparate auditory and visual stimuli. , 2003, Brain research. Cognitive brain research.

[115]  C. Spence,et al.  Crossmodal Space and Crossmodal Attention , 2004 .

[116]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[117]  Martin Heckmann,et al.  Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[118]  Roland Siegwart,et al.  Multimodal detection and tracking of pedestrians in urban environments with explicit ground plane extraction , 2008 .

[119]  R. E. Kalman,et al.  New Results in Linear Filtering and Prediction Theory , 1961 .

[120]  Zhihong Zeng,et al.  Audio-Visual Affect Recognition , 2007, IEEE Transactions on Multimedia.

[121]  Arnak S. Dalalyan,et al.  A global camera network calibration method with Linear Programming , 2010 .

[122]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[123]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[124]  Jean Ponce,et al.  Audio-Visual Speaker Localization Using Graphical Models , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[125]  Radu Horaud,et al.  Patterns of Binocular Disparity for a Fixating Observer , 2007, BVAI.

[126]  Ronald Mahler A general theory of multitarget extended Kalman filters , 2005, SPIE Defense + Commercial Sensing.

[127]  Sidney S. Simon,et al.  Merging of the Senses , 2008, Front. Neurosci..

[128]  S. M. Ermakow Die Monte-Carlo-Methode und verwandte Fragen , 1975 .

[129]  Michael F. Cohen,et al.  Fourier Analysis of the 2D Screened Poisson Equation for Gradient Domain Problems , 2008, ECCV.

[130]  Jon Barker,et al.  The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements , 2008, ICMI '08.

[131]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.