Multimodal identification and localization of users in a smart environment

Detecting the location and identity of users is a first step in creating context-aware applications for technologically-endowed environments. We propose a system that makes use of motion detection, person tracking, face identification, feature-based identification, audio-based localization, and audio-based identification modules, fusing information with particle filters to obtain robust localization and identification. The data streams are processed with the help of the generic client-server middleware SmartFlow, resulting in a flexible architecture that runs across different platforms.

[1]  Lawrence Sirovich,et al.  Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Jordi Luque,et al.  Speaker Diarization for Conference Room: The UPC RT07s Evaluation System , 2007, CLEAR.

[3]  Albert Ali Salah,et al.  Incremental mixtures of factor analysers , 2004, ICPR 2004.

[4]  Xavier Anguera Miró,et al.  Robust Speaker Diarization for Meetings: ICSI RT06S Meetings Evaluation System , 2006, MLMI.

[5]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[6]  Michael Shapiro Brandstein,et al.  A framework for speech source localization using sensor arrays , 1995 .

[7]  P. Fearnhead,et al.  Improved particle filter for nonlinear problems , 1999 .

[8]  S. Intille,et al.  Improving Multiple People Tracking Using Temporal Consistency , .

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  Jacob Benesty,et al.  An adaptive blind SIMO identification approach to joint multichannel time delay estimation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Frank Dellaert,et al.  Efficient particle filter-based tracking of multiple interacting targets using an MRF-based motion model , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[12]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Rainer Stiefelhagen,et al.  Multiple Object Tracking Performance Metrics and Evaluation in a Smart Room Environment , 2006 .

[14]  Nikos Fakotakis,et al.  Multi-speaker DOA tracking using interactive multiple models and probabilistic data association , 2003, INTERSPEECH.

[15]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[16]  Martial Michel,et al.  The NIST Smart Space and Meeting Room projects: signals, acquisition annotation, and metrics , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Rainer Stiefelhagen,et al.  The CLEAR 2006 Evaluation , 2006, CLEAR.

[18]  Isaac Cohen,et al.  Jeju Island , Korea TRACKING PEOPLE IN CROWDED SCENES ACROSS MULTIPLE CAMERAS , 2004 .

[19]  Trevor Darrell,et al.  Multiple person and speaker activity tracking with a particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Pascal Fua,et al.  Multicamera People Tracking with a Probabilistic Occupancy Map , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xavier Anguera Miró,et al.  Robust speaker diarization for meetings: ICSI RT06s evaluation system , 2006, INTERSPEECH.

[22]  Alex Pentland,et al.  Pfinder: real-time tracking of the human body , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[23]  Hervé Bourlard,et al.  Robust HMM-based speech/music segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  BlakeAndrew,et al.  C ONDENSATION Conditional Density Propagation forVisual Tracking , 1998 .

[25]  Walter F. Tichy,et al.  A Communication Middleware for Smart Room Environments , 2007, AmI.

[26]  L. Davis,et al.  M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene , 2003, International Journal of Computer Vision.

[27]  Mireia Farrús,et al.  Audio, Video and Multimodal Person Identification in a Smart Room , 2006, CLEAR.

[28]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[29]  Larry S. Davis,et al.  W4: Real-Time Surveillance of People and Their Activities , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Maurizio Omologo,et al.  Use of the crosspower-spectrum phase in acoustic event location , 1997, IEEE Trans. Speech Audio Process..

[32]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[33]  Guillaume Gravier,et al.  Experiments on speaker tracking and segmentation in radio broadcast news , 2005, INTERSPEECH.

[34]  Yuan-Fang Wang,et al.  Real-time multiperson tracking in video surveillance , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[35]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[36]  Rama Chellappa,et al.  Probabilistic recognition of human faces from video , 2002, Proceedings. International Conference on Image Processing.

[37]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[38]  Andrey Temko,et al.  Enhanced SVM Training for Robust Speech Activity Detection , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[39]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[40]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[41]  Verónica Vilaplana,et al.  Face Recognition using Groups of Images in Smart Room Scenarios , 2006, 2006 International Conference on Image Processing.

[42]  Arun Ross,et al.  Microphone Arrays , 2009, Encyclopedia of Biometrics.

[43]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[44]  Ramakant Nevatia,et al.  Segmentation and Tracking of Multiple Humans in Crowded Environments , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Douglas A. Reynolds,et al.  A Tutorial on Text-Independent Speaker Verification , 2004, EURASIP J. Adv. Signal Process..

[46]  James Black,et al.  Multi view image surveillance and tracking , 2002, Workshop on Motion and Video Computing, 2002. Proceedings..

[47]  Ben A. M. Schouten,et al.  Transparent face recognition in an unconstrained environment using a Sparse representation from multiple still images [18th International Conference on Pattern Recognition (ICPR'06)] , 2006 .

[48]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[49]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[50]  Jean-Luc Gauvain,et al.  Improving Speaker Diarization , 2004 .

[51]  Aristodemos Pnevmatikakis,et al.  3D Audiovisual Person Tracking Using Kalman Filtering and Information Theory , 2006, CLEAR.

[52]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[53]  Larry S. Davis,et al.  Multimodal 3-D tracking and event detection via the particle filter , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.