Efficient video coding in H.264/AVC by using audio-visual information

This paper proposes an efficient video coding method which utilizes audio-visual information, based on the observation that sound-emitting regions in a video sequence attract observer's attention. The regions responsible for the sound are identified by an audio-visual source localization algorithm. Then, the result is used for encoding different regions in the scene with different quality in such a way that a region far from the sound source is coded with a lesser quality than the sound-emitting regions. This is implemented by assigning different quantization parameter values for different regions in H.264/AVC. Experimental results demonstrate the effectiveness of the proposed approach.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Laurent Itti,et al.  Automatic foveation for video compression using a neurobiological model of visual attention , 2004, IEEE Transactions on Image Processing.

[3]  Andrea Cavallaro,et al.  Target Detection and Tracking With Heterogeneous Sensors , 2008, IEEE Journal of Selected Topics in Signal Processing.

[4]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Chih-Wei Tang,et al.  Spatiotemporal Visual Considerations for Video Coding , 2007, IEEE Transactions on Multimedia.

[6]  Cheol Hoon Park,et al.  Robust Audio-Visual Speech Recognition Based on Late Integration , 2008, IEEE Transactions on Multimedia.

[7]  J. Driver,et al.  Audiovisual links in endogenous covert spatial attention. , 1996, Journal of experimental psychology. Human perception and performance.

[8]  Touradj Ebrahimi,et al.  Semantic video analysis for adaptive content delivery and automatic description , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  C. Spence,et al.  Attention and the crossmodal construction of space , 1998, Trends in Cognitive Sciences.

[10]  Steven A. Hillyard,et al.  Neural Substrates of Perceptual Enhancement by Cross-Modal Spatial Attention , 2003, Journal of Cognitive Neuroscience.

[11]  Jean-Philippe Thiran,et al.  Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection , 2008, IEEE Transactions on Multimedia.

[12]  Michael Elad,et al.  Cross-Modal Localization via Sparsity , 2007, IEEE Transactions on Signal Processing.

[13]  J. Driver,et al.  Audiovisual links in exogenous covert spatial orienting , 1997, Perception & psychophysics.

[14]  Christian Jutten,et al.  Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Paolo Napoletano,et al.  Bayesian Integration of Face and Low-Level Cues for Foveated Video Coding , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  B. Stein,et al.  The Merging of the Senses , 1993 .

[17]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[18]  Sugato Chakravarty,et al.  Methodology for the subjective assessment of the quality of television pictures , 1995 .

[19]  Touradj Ebrahimi,et al.  Video coding based on audio-visual attention , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[20]  A. Murat Tekalp,et al.  Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[21]  Patrick Pérez,et al.  Data fusion for visual tracking with particles , 2004, Proceedings of the IEEE.