Tracking Humans using Multi-modal Fusion

Human motion detection plays an important role in automated surveillance systems. However, it is challenging to detect non-rigid moving objects (e.g. human) robustly in a cluttered environment. In this paper, we compare two approaches for detecting walking humans using multi-modal measurements- video and audio sequences. The first approach is based on the Time-Delay Neural Network (TDNN), which fuses the audio and visual data at the feature level to detect the walking human. The second approach employs the Bayesian Network (BN) for jointly modeling the video and audio signals. Parameter estimation of the graphical models is executed using the Expectation-Maximization (EM) algorithm. And the location of the target is tracked by the Bayes inference. Experiments are performed in several indoor and outdoor scenarios: in the lab, more than one person walking, occlusion by bushes etc. The comparison of performance and efficiency of the two approaches are also presented.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[3]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[4]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[5]  I.A. Essa,et al.  Ubiquitous sensing for smart and aware environments , 2000, IEEE Wirel. Commun..

[6]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[7]  A. Blake,et al.  Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[8]  Takeo Kanade,et al.  Algorithms for cooperative multisensor surveillance , 2001, Proc. IEEE.

[9]  Trevor Darrell,et al.  Signal level fusion for multimodal perceptual user interface , 2001, PUI '01.

[10]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Bir Bhanu,et al.  Physics-based models of color and IR video for sensor fusion , 2003, Proceedings of IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, MFI2003..

[12]  Vladimir Pavlovic,et al.  Boosted learning in dynamic Bayesian networks for multimodal speaker detection , 2003, Proc. IEEE.

[13]  B. Frey,et al.  Transformation-Invariant Clustering Using the EM Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Bir Bhanu,et al.  Moving Humans Detection Based on Multi-Modal Sensor Fusion , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.