论文信息 - Efficient video coding in H.264/AVC by using audio-visual information

Efficient video coding in H.264/AVC by using audio-visual information

This paper proposes an efficient video coding method which utilizes audio-visual information, based on the observation that sound-emitting regions in a video sequence attract observer's attention. The regions responsible for the sound are identified by an audio-visual source localization algorithm. Then, the result is used for encoding different regions in the scene with different quality in such a way that a region far from the sound source is coded with a lesser quality than the sound-emitting regions. This is implemented by assigning different quantization parameter values for different regions in H.264/AVC. Experimental results demonstrate the effectiveness of the proposed approach.

Touradj Ebrahimi | Jong-Seok Lee | T. Ebrahimi | Jong-Seok Lee

[1] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[2] Laurent Itti,et al. Automatic foveation for video compression using a neurobiological model of visual attention , 2004, IEEE Transactions on Image Processing.

[3] Andrea Cavallaro,et al. Target Detection and Tracking With Heterogeneous Sensors , 2008, IEEE Journal of Selected Topics in Signal Processing.

[4] J.N. Gowdy,et al. CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Chih-Wei Tang,et al. Spatiotemporal Visual Considerations for Video Coding , 2007, IEEE Transactions on Multimedia.

[6] Cheol Hoon Park,et al. Robust Audio-Visual Speech Recognition Based on Late Integration , 2008, IEEE Transactions on Multimedia.

[7] J. Driver,et al. Audiovisual links in endogenous covert spatial attention. , 1996, Journal of experimental psychology. Human perception and performance.

[8] Touradj Ebrahimi,et al. Semantic video analysis for adaptive content delivery and automatic description , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[9] C. Spence,et al. Attention and the crossmodal construction of space , 1998, Trends in Cognitive Sciences.

[10] Steven A. Hillyard,et al. Neural Substrates of Perceptual Enhancement by Cross-Modal Spatial Attention , 2003, Journal of Cognitive Neuroscience.

[11] Jean-Philippe Thiran,et al. Extraction of Audio Features Specific to Speech Production for Multimodal Speaker Detection , 2008, IEEE Transactions on Multimedia.

[12] Michael Elad,et al. Cross-Modal Localization via Sparsity , 2007, IEEE Transactions on Signal Processing.

[13] J. Driver,et al. Audiovisual links in exogenous covert spatial orienting , 1997, Perception & psychophysics.

[14] Christian Jutten,et al. Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Paolo Napoletano,et al. Bayesian Integration of Face and Low-Level Cues for Foveated Video Coding , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[16] B. Stein,et al. The Merging of the Senses , 1993 .

[17] Vladimir Pavlovic,et al. Toward multimodal human-computer interface , 1998, Proc. IEEE.

[18] Sugato Chakravarty,et al. Methodology for the subjective assessment of the quality of television pictures , 1995 .

[19] Touradj Ebrahimi,et al. Video coding based on audio-visual attention , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[20] A. Murat Tekalp,et al. Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis , 2007, IEEE Transactions on Multimedia.

[21] Patrick Pérez,et al. Data fusion for visual tracking with particles , 2004, Proceedings of the IEEE.