Automatic Music Transcription using Audio-Visual Fusion for Violin Practice in Home Environment

Violin practice in a home environment, where there is often no teacher available, can benefit from automatic music transcription to provide feedback to the student. This paper describes a high performance violin transcription system with three main contributions. First, as onset detection is an important but challenging task for automatic transcription of pitched non-percussive music, such as from the violin, we propose an effective audio-only onset detection approach based on supervised learning. The proposed approach outperforms the state-of-the-art methods substantially. Second, we introduce the visual modality, i.e., bowing and fingering of the violin playing, to infer onsets, thus enhancing the audio-only onset detection. We devise automatic and realtime video processing algorithms to extract indicative features of onsets from bowing and fingering videos. Third, we evaluate state-of-the-art multimodal fusion techniques to fuse audio and visual modalities and show this improves onset detection and transcription performance significantly. The audio-visual fusion based violin transcription system provides more accurate transcribed results as learning feedback even in acoustically inferior environments. With efficient and fully automatic audio-visual analysis components, the system can be easily deployed in a home environment.

[1]  Julian Fiérrez,et al.  A Comparative Evaluation of Fusion Strategies for Multimodal Biometric Verification , 2003, AVBPA.

[2]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[3]  David Hsu,et al.  Digital violin tutor: an integrated system for beginning violin learners , 2005, ACM Multimedia.

[4]  Gaël Richard,et al.  Automatic transcription of drum sequences using audiovisual features , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  D. L. Hall,et al.  Mathematical Techniques in Multisensor Data Fusion , 1992 .

[6]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[7]  Nick Collins A Comparison of Sound Onset Detection Algorithms with Emphasis on Psychoacoustically Motivated Detection Functions , 2005 .

[8]  E. Brookner Tracking and Kalman Filtering Made Easy , 1998 .

[9]  Ye Wang,et al.  Low Level Descriptors for Automatic Violin Transcription , 2006, ISMIR.

[10]  Ajay Kapur,et al.  Multimodal Sensor Analysis of Sitar Performance: Where is the Beat? , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[11]  Yonina C. Eldar,et al.  A probabilistic Hough transform , 1991, Pattern Recognit..

[12]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[13]  Bingjun Zhang,et al.  Educational violin transcription by fusing multimedia streams , 2007, Emme '07.

[14]  Anssi Klapuri,et al.  Sound onset detection by applying psychoacoustic knowledge , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Mark B. Sandler,et al.  Phase-based note onset detection for music signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[17]  Douglas Eck,et al.  A Supervised Classification Algorithm for Note Onset Detection , 2006, EURASIP J. Adv. Signal Process..

[18]  Nick Collins Using a Pitch Detector for Onset Detection , 2005, ISMIR.

[19]  Vladimir Vezhnevets,et al.  A Survey on Pixel-Based Skin Color Detection Techniques , 2003 .

[20]  Bingjun Zhang,et al.  Visual analysis of fingering for pedagogical violin transcription , 2007, ACM Multimedia.

[21]  Alberto Bachmann,et al.  An Encyclopedia Of The Violin , 1925 .

[22]  M. Wiesendanger,et al.  Coordination of bowing and fingering in violin playing. , 2005, Brain research. Cognitive brain research.

[23]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[24]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[25]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.