Current trends in joint audio-video signal processing: a review

Multimodal signal processing has gained a lot of significance in recent years due to advances in computer technology as well as more sophisticated sensors being available. One example is the joint processing of audio and video signals in a variety of applications. This paper serves as a broad introduction to the special session on “Audio-Video Signal Processing and its Applications”. The paper reviews current trends and developments in joint audio-video (AV) signal processing and gives an overview of current issues in theory and application in this area. We focus on speech processing, person authentication, and affective sensing as examples. An overview of available AV data corpora is given.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[3]  Michael Wagner,et al.  "liveness" Verification in Audio-video Authentication , 2004, INTERSPEECH.

[4]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[6]  Rama Chellappa,et al.  Human and machine recognition of faces: a survey , 1995, Proc. IEEE.

[7]  Roland Göcke,et al.  The audio-video australian English speech data corpus AVOZES , 2012, INTERSPEECH.

[8]  Jiri Matas,et al.  Acquisition of a Large Database for Biometric Identity Verification , 1998 .

[9]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[10]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[11]  Farzin Deravi,et al.  Design issues for a digital audio-visual integrated database , 1996 .

[12]  David C. Gibbon,et al.  Multi-modal system for locating heads and faces , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[13]  Jean-Luc Schwartz,et al.  Exploiting sensor fusion architectures and stimuli complementarity in AV speech recognition , 1996 .

[14]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[15]  Alexander Zelinsky,et al.  Automatic Extraction of Lip Feature Points , 2000 .

[16]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[17]  Rosalind W. Picard Affective Computing , 1997 .

[18]  R. Gibson,et al.  What the Face Reveals , 2002 .

[19]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Kuldip K. Paliwal,et al.  Fast features for face authentication under illumination direction changes , 2003, Pattern Recognit. Lett..

[21]  Louis ten Bosch,et al.  Emotions, speech and the ASR framework , 2003, Speech Commun..

[22]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Michael T. Chan,et al.  HMM-based audio-visual speech recognition integrating geometric- and appearance-based visual features , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[24]  Michael Wagner,et al.  Aspects of speaking-face data corpus design methodology , 2004, INTERSPEECH.

[25]  K PaliwalKuldip,et al.  Fast features for face authentication under illumination direction changes , 2003 .

[26]  Ioannis Pitas,et al.  Recent advances in biometric person authentication , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[28]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[29]  Juergen Luettin,et al.  Active Shape Models for Visual Speech Feature Extraction , 1996 .

[30]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..