Abstract In this paper we describe a technique of classifier combi-nation used in a human identification system. The systemintegrates all available features from multi-modal sourceswithin a Bayesian framework. The framework allows repre-senting a class of popular classifier combination rules andmethods within a single formalism. It relies on a “per-class” measure of confidence derived from performance ofeach classifier on training data that is shown to improveperformance on a synthetic data set. The method is es-pecially relevant in autonomous surveillance setting wherevarying time scales and missing features are a commonoccurrence. We show an application of this technique tothe real-world surveillance database of video and audiorecordings of people collected over several weeks in the of-fice setting. 1 Introduction and Motivation In problems of biometric verification and identification alarge role is played by the multi-modal aspect of the obser-vation. A person can be identified by a number of features,including face, height, body shape, gait, voice etc. How-ever, the features are not equal in their overall contributionto identifying a person. For instance, modern algorithms forface classification (e.g. [11]) and speaker identification (e.g.[6]) can attain high recognition rates, provided that the datais well formed and is relatively free of variations and noise,while other features, such as, gait (e.g. [1]) or body shape,are only mildly discriminative.Even though one can achieve high recognition rateswhen classifying some of these features, in reality they areobserved only relatively rarely - in a surveillance video se-quence the face image can only be used if the person is closeenough and is facing the camera, or a person’s voice whenthe person is speaking. In contrast, there is a plentiful sup-ply of the less discriminative features. This situation is il-lustrated on an example of one of our video sequences infigure 1.
[1]
Larry S. Davis,et al.
Stride and cadence as a biometric in automatic person identification and verification
,
2002,
Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.
[2]
Jeho Nam,et al.
Speaker identification and video analysis for hierarchical video shot classification
,
1997,
Proceedings of International Conference on Image Processing.
[3]
Jiri Matas,et al.
Combining evidence in personal identity verification systems
,
1997,
Pattern Recognit. Lett..
[4]
Arun Ross,et al.
Information fusion in biometrics
,
2003,
Pattern Recognit. Lett..
[5]
Azriel Rosenfeld,et al.
Face recognition: A literature survey
,
2003,
CSUR.
[6]
Josef Kittler,et al.
Combining multiple classifiers by averaging or by multiplying?
,
2000,
Pattern Recognit..
[7]
Jiri Matas,et al.
Combining Evidence in Multimodal Personal Identity Recognition Systems
,
1997,
AVBPA.
[8]
Thomas Serre,et al.
Categorization by Learning and Combining Object Parts
,
2001,
NIPS.
[9]
Robert P. W. Duin,et al.
A Discussion on the Classifier Projection Space for Classifier Combining
,
2002,
Multiple Classifier Systems.
[10]
Jeff A. Bilmes,et al.
Directed graphical models of classifier combination: application to phone recognition
,
2000,
INTERSPEECH.