论文信息 - Automatic Music Transcription using Audio-Visual Fusion for Violin Practice in Home Environment

Automatic Music Transcription using Audio-Visual Fusion for Violin Practice in Home Environment

Violin practice in a home environment, where there is often no teacher available, can benefit from automatic music transcription to provide feedback to the student. This paper describes a high performance violin transcription system with three main contributions. First, as onset detection is an important but challenging task for automatic transcription of pitched non-percussive music, such as from the violin, we propose an effective audio-only onset detection approach based on supervised learning. The proposed approach outperforms the state-of-the-art methods substantially. Second, we introduce the visual modality, i.e., bowing and fingering of the violin playing, to infer onsets, thus enhancing the audio-only onset detection. We devise automatic and realtime video processing algorithms to extract indicative features of onsets from bowing and fingering videos. Third, we evaluate state-of-the-art multimodal fusion techniques to fuse audio and visual modalities and show this improves onset detection and transcription performance significantly. The audio-visual fusion based violin transcription system provides more accurate transcribed results as learning feedback even in acoustically inferior environments. With efficient and fully automatic audio-visual analysis components, the system can be easily deployed in a home environment.

Bingjun Zhang | Ye Wang | Ye Wang | Bingjun Zhang

[1] Julian Fiérrez,et al. A Comparative Evaluation of Fusion Strategies for Multimodal Biometric Verification , 2003, AVBPA.

[2] Ishwar K. Sethi,et al. Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[3] David Hsu,et al. Digital violin tutor: an integrated system for beginning violin learners , 2005, ACM Multimedia.

[4] Gaël Richard,et al. Automatic transcription of drum sequences using audiovisual features , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5] D. L. Hall,et al. Mathematical Techniques in Multisensor Data Fusion , 1992 .

[6] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[7] Nick Collins. A Comparison of Sound Onset Detection Algorithms with Emphasis on Psychoacoustically Motivated Detection Functions , 2005 .

[8] E. Brookner. Tracking and Kalman Filtering Made Easy , 1998 .

[9] Ye Wang,et al. Low Level Descriptors for Automatic Violin Transcription , 2006, ISMIR.

[10] Ajay Kapur,et al. Multimodal Sensor Analysis of Sitar Performance: Where is the Beat? , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[11] Yonina C. Eldar,et al. A probabilistic Hough transform , 1991, Pattern Recognit..

[12] R. Redner,et al. Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[13] Bingjun Zhang,et al. Educational violin transcription by fusing multimedia streams , 2007, Emme '07.

[14] Anssi Klapuri,et al. Sound onset detection by applying psychoacoustic knowledge , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15] Mark B. Sandler,et al. Phase-based note onset detection for music signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16] Edward Y. Chang,et al. Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[17] Douglas Eck,et al. A Supervised Classification Algorithm for Note Onset Detection , 2006, EURASIP J. Adv. Signal Process..

[18] Nick Collins. Using a Pitch Detector for Onset Detection , 2005, ISMIR.

[19] Vladimir Vezhnevets,et al. A Survey on Pixel-Based Skin Color Detection Techniques , 2003 .

[20] Bingjun Zhang,et al. Visual analysis of fingering for pedagogical violin transcription , 2007, ACM Multimedia.

[21] Alberto Bachmann,et al. An Encyclopedia Of The Violin , 1925 .

[22] M. Wiesendanger,et al. Coordination of bowing and fingering in violin playing. , 2005, Brain research. Cognitive brain research.

[23] Douglas A. Reynolds,et al. Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[24] Beth Logan,et al. Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[25] Carlo Tomasi,et al. Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.