Detection of Audio-Video Synchronization Errors Via Event Detection

We present a new method and a large-scale database to detect audio-video synchronization(A/V sync) errors in tennis videos. A deep network is trained to detect the visual signature of the tennis ball being hit by the racquet in the video stream. Another deep network is trained to detect the auditory signature of the same event in the audio stream. During evaluation, the audio stream is searched by the audio network for the audio event of the ball being hit. If the event is found in audio, the neighboring interval in video is searched for the corresponding visual signature. If the event is not found in the video stream but is found in the audio stream, A/V sync error is flagged. We developed a large-scaled database of 504,300 frames from 6 hours of videos of tennis events, simulated A/V sync errors, and found our method achieves high accuracy on the task.

[1]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[2]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Tara N. Sainath,et al.  Deep Learning for Audio Signal Processing , 2019, IEEE Journal of Selected Topics in Signal Processing.

[6]  Naji Khosravan,et al.  On Attention Modules for Audio-Visual Synchronization , 2018, CVPR Workshops.

[7]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Jing Wang,et al.  A Study on the Factors Affecting Audio-Video Subjective Experience in Virtual Reality Environments , 2017, 2017 International Conference on Virtual Reality and Visualization (ICVRV).

[9]  Sainath Adapa,et al.  Urban Sound Tagging using Convolutional Neural Networks , 2019, DCASE.

[10]  Kamalesh Palanisamy,et al.  Rethinking CNN Models for Audio Classification , 2020, ArXiv.

[11]  Hervé Glotin,et al.  Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge , 2018, Methods in Ecology and Evolution.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Tuomas Virtanen,et al.  Convolutional recurrent neural networks for bird audio detection , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[16]  Walter Daems,et al.  Low-cost synchronization of high-speed audio and video recordings in bio-acoustic experiments , 2018, Journal of Experimental Biology.

[17]  Grzegorz Gwardys,et al.  Deep Image Features in Music Information Retrieval , 2014 .

[18]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[19]  Andreas Dengel,et al.  ESResNet: Environmental Sound Classification Based on Visual Domain Models , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[20]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.