MuViSync: Realtime music video alignment

In recent years, the popularity of compressed music files and online music downloads has increased dramatically. Today's users own large digital collections of high quality music on their computers and portable devices to be played in their homes or on the go. In addition, music videos are being offered online both for free and through monthly subscriptions, opening up the opportunity to turn the music listening activity into a multimedia experience. The work presented in this paper addresses the challenge of music audio and music video synchronisation. In particular, we have developed a prototype, named MuViSync, to automatically synchronise music videos to the songs that users are listening to in real-time. At the core of the MuViSync prototype are novel audio synchronization algorithms to tackle the differences in tempo, pitch, sampling rates, structure, and introductions and endings that are common in the various digital recordings of the same song in modern music. The music and the music video are initially aligned and then kept in sync within the limits of human perception. In experiments with 320 matching pairs of audio files and music videos, the proposed algorithms successfully synchronise music and video to within 100 milliseconds of each other in over 90% of the cases.

[1]  Andreas Girgensohn,et al.  Creating music videos using automatic media analysis , 2002, MULTIMEDIA '02.

[2]  Lie Lu,et al.  Automatic music video generation based on temporal pattern analysis , 2004, MULTIMEDIA '04.

[3]  Chng Eng Siong,et al.  Automatic generation of personalized music sports video , 2005, MULTIMEDIA '05.

[4]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[5]  Tristan Jehan,et al.  Cati dance: self-edited, self-synchronized music video , 2003, SIGGRAPH '03.

[6]  Simon Dixon,et al.  LIVE TRACKING OF MUSICAL PERFORMANCES USING ON-LINE TIME WARPING , 2005 .

[7]  Meinard Müller,et al.  Syncplayer - An Advanced System for Multimodal Music Access , 2005, ISMIR.

[8]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[9]  Stan Salvador,et al.  FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space , 2004 .

[10]  Siwoo Byun,et al.  Automated music video generation using multi-level feature-based segmentation , 2009, Multimedia tools and applications.

[11]  Vicky Hardman,et al.  Lip synchronisation for use over the Internet: analysis and implementation , 1996, Proceedings of GLOBECOM'96. 1996 IEEE Global Telecommunications Conference.

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  Mark D. Plumbley,et al.  B-Keeper: a beat-tracker for live performance , 2007, NIME '07.

[14]  Ye Wang,et al.  LyricAlly: Automatic Synchronization of Textual Lyrics to Acoustic Music Signals , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  John S. Boreczky,et al.  A hidden Markov model framework for video segmentation using audio and image features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Christopher Raphael,et al.  Music Plus One: A System for Expressive and Flexible Musical Accompaniment , 2001, ICMC.

[17]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .