Educational violin transcription by fusing multimedia streams

Computer-assisted violin tutoring requires accurate violin transcription. For pitched non-percussive (PNP) sounds such as from the violin, note segmentation is a much more difficult task than pitch detection. This issue is accentuated when the audio is recorded during an instrument practice session at home which is acoustically inferior to a professional recording studio. This paper presents a new approach to the problem by using the correlation between different media streams for e-learning applications. We design a capture mechanism to record one audio and two video streams simultaneously, and exploit the relationships among them for enhanced transcription. State-of-the-art audio methods for note segmentation and pitch estimation are implemented as the audio-only baseline. Two web-cameras are employed to track the right hand (bowing) and the left hand's four fingers (fingering) on the fingerboard, respectively. The audio and visual information is then fused in the feature space. Our new approach is evaluated with an audio-visual violin music database containing 16 complete music pieces of different styles with 2157 notes in total. Experimental results show that our multimodal approach achieves a 10% increase in true positives, and a 8% reduction in false positives of overall transcription performance in comparison with the audio-only baseline.

[1]  Marcelo M. Wanderley,et al.  Visual Methods for the Retrieval of Guitarist Fingering , 2006, NIME.

[2]  Anssi Klapuri,et al.  Multiple Fundamental Frequency Estimation by Summing Harmonic Amplitudes , 2006, ISMIR.

[3]  Jonathan Foote,et al.  Automatic audio segmentation using a measure of audio novelty , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[4]  M. Wiesendanger,et al.  Coordination of bowing and fingering in violin playing. , 2005, Brain research. Cognitive brain research.

[5]  Gaël Richard,et al.  Automatic transcription of drum sequences using audiovisual features , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Ye Wang,et al.  A Violin Music Transcriber for Personalized Learning , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[7]  Gaël Richard,et al.  Methodology and Tools for the evaluation of automatic onset detection algorithms in music , 2004, ISMIR.

[8]  Nick Collins Using a Pitch Detector for Onset Detection , 2005, ISMIR.

[9]  Mark B. Sandler,et al.  A tutorial on onset detection in music signals , 2005, IEEE Transactions on Speech and Audio Processing.

[10]  Ying Wu,et al.  Analyzing and capturing articulated hand motion in image sequences , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jian Zhang,et al.  Analysis of lip geometric features for audio-visual speech recognition , 2004, IEEE Trans. Syst. Man Cybern. Part A.

[12]  Nick Collins A Comparison of Sound Onset Detection Algorithms with Emphasis on Psychoacoustically Motivated Detection Functions , 2005 .

[13]  David Perkins,et al.  Smart Schools:: Better thinking and learning for every child , 2008 .

[14]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[15]  G. Clark,et al.  Reference , 2008 .

[16]  Julien Letessier,et al.  Visual tracking of bare fingers for interactive surfaces , 2004, UIST '04.

[17]  Satoshi Nakamura,et al.  Statistical multimodal integration for audio-visual speech processing , 2002, IEEE Trans. Neural Networks.

[18]  Ye Wang,et al.  Low Level Descriptors for Automatic Violin Transcription , 2006, ISMIR.

[19]  David Hsu,et al.  Digital violin tutor: an integrated system for beginning violin learners , 2005, ACM Multimedia.

[20]  A.P. Klapuri,et al.  A perceptually motivated multiple-F0 estimation method , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[21]  FragopanagosN.,et al.  2005 Special Issue , 2005 .

[22]  Chalapathy Neti,et al.  Frame-dependent multi-stream reliability indicators for audio-visual speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..