Error weighted classifier combination for multi-modal human identification

Abstract In this paper we describe a technique of classifier combi-nation used in a human identification system. The systemintegrates all available features from multi-modal sourceswithin a Bayesian framework. The framework allows repre-senting a class of popular classifier combination rules andmethods within a single formalism. It relies on a “per-class” measure of confidence derived from performance ofeach classifier on training data that is shown to improveperformance on a synthetic data set. The method is es-pecially relevant in autonomous surveillance setting wherevarying time scales and missing features are a commonoccurrence. We show an application of this technique tothe real-world surveillance database of video and audiorecordings of people collected over several weeks in the of-fice setting. 1 Introduction and Motivation In problems of biometric verification and identification alarge role is played by the multi-modal aspect of the obser-vation. A person can be identified by a number of features,including face, height, body shape, gait, voice etc. How-ever, the features are not equal in their overall contributionto identifying a person. For instance, modern algorithms forface classification (e.g. [11]) and speaker identification (e.g.[6]) can attain high recognition rates, provided that the datais well formed and is relatively free of variations and noise,while other features, such as, gait (e.g. [1]) or body shape,are only mildly discriminative.Even though one can achieve high recognition rateswhen classifying some of these features, in reality they areobserved only relatively rarely - in a surveillance video se-quence the face image can only be used if the person is closeenough and is facing the camera, or a person’s voice whenthe person is speaking. In contrast, there is a plentiful sup-ply of the less discriminative features. This situation is il-lustrated on an example of one of our video sequences infigure 1.