Feature-level data fusion for bimodal person recognition

Consistently high person recognition accuracy is difficult to attain using a single recognition modality. This paper assesses the fusion of voice and outer lip-margin features for person identification. Feature fusion is investigated in the form of audio-visual feature vector concatenation, principal component analysis, and linear discriminant analysis. The paper shows that, under mismatched test and training conditions, audio-visual feature fusion is equivalent to an effective increase in the signal-to-noise ratio of the audio signal. Audio-visual feature vector concatenation is shown to be an effective method for feature combination, and linear discriminant analysis is shown to possess the capability of packing discriminating audio-visual information into fewer coefficients than principal component analysis. The paper reveals a high sensitivity of bimodal person identification to a mismatch between LDA or PCA feature-fusion module and speaker model training noise-conditions. Such a mismatch leads to worse identification accuracy than unimodal identification.