Bimodal log-linear regression for fusion of audio and visual features

One of the most commonly used audiovisual fusion approaches is feature-level fusion where the audio and visual features are concatenated. Although this approach has been successfully used in several applications, it does not take into account interactions between the features, which can be a problem when one and/or both modalities have noisy features. In this paper, we investigate whether feature fusion based on explicit modelling of interactions between audio and visual features can enhance the performance of the classifier that performs feature fusion using simple concatenation of the audio-visual features. To this end, we propose a log-linear model, named Bimodal Log-linear regression, which accounts for interactions between the features of the two modalities. The performance of the target classifiers is measured in the task of laughter-vs-speech discrimination, since both laughter and speech are naturally audiovisual events. Our experiments on the MAHNOB laughter database suggest that feature fusion based on explicit modelling of interactions between the audio-visual features leads to an improvement of 3\% over the standard feature concatenation approach, when log-linear model is used as the base classifier. Finally, the most and least influential features can be easily identified by observing their interactions.

[1]  Daniel González-Jiménez,et al.  Toward Pose-Invariant 2-D Face Recognition Through Point Distribution Models and Facial Symmetry , 2007, IEEE Transactions on Information Forensics and Security.

[2]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Maja Pantic,et al.  Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help , 2011, IEEE Transactions on Multimedia.

[4]  Maja Pantic,et al.  Comparison of prediction-based fusion and feature-level fusion across different learning models , 2012, ACM Multimedia.

[5]  Tien D. Bui,et al.  Multivariate statistical modeling for image denoising using wavelet transforms , 2005, Signal Process. Image Commun..

[6]  Alexander C. Loui,et al.  Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.

[7]  Maja Pantic,et al.  The MAHNOB Laughter database , 2013, Image Vis. Comput..

[8]  Maja Pantic,et al.  Particle filtering with factorized likelihoods for tracking facial features , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..