论文信息 - FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks

FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks

FaceSync is an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices.

Malcolm Slaney | Michele Covell | M. Covell | M. Slaney

[1] Fumitada Itakura,et al. Speech analysis and synthesis methods developed at ECL in NTT - From LPC to LSP - , 1986, Speech Commun..

[2] Hani Yehia,et al. Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[3] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[5] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[6] Takeo Kanade,et al. Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7] John K. Thomas,et al. Wiener filters in canonical coordinates for transform coding, filtering, and quantizing , 1998, IEEE Trans. Signal Process..

[8] Christoph Bregler,et al. Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.