Voice-Melody Transcription Under a Speech Recognition Framework

This paper presents a robust voice-melody transcription system using a speech recognition framework. While many previous voice-melody transcription systems have utilized non-statistical approaches, statistical recognition technology can potentially achieve more robust results. A cepstrum-based acoustic model is employed to avoid the hard-decisions that have to be made when using explicit voiced-unvoiced segmentation and pitch extraction, and a key-independent 4-gram language model is employed to capture prior probabilities of different melodic sequences. Evaluations are done from the perspective of both note recognition error rate and query-by-humming end-to-end performance. The results are compared with three other voice-melody transcription systems. Experiments have shown that our system is state-of-the-art: it is much more robust than other systems on data containing noise, and close to the best of all the systems on the clean data set.

[1]  Geoffrey Zweig,et al.  Advances in speech transcription at IBM under the DARPA EARS program , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Mark D. Plumbley,et al.  Fast labelling of notes in music signals , 2004, ISMIR.

[3]  Jean-Gabriel Ganascia,et al.  Musical content-based retrieval: an overview of the Melodiscov approach and system , 1999, MULTIMEDIA '99.

[4]  C.-C. Jay Kuo,et al.  Multidimensional humming transcription using a statistical approach for query by humming systems , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[5]  Marc Leman,et al.  Recent improvements of an auditory model based front-end for the transcription of vocal queries , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Timo Viitaniemi,et al.  Probabilistic models for the transcription of single-voice melodies , 2003 .

[7]  Emanuele Pollastri An Audio Front End for Query-by-Humming Systems , 2001, ISMIR.