Singing voice recognition based on matching of spectrogram pattern

Singing voice recognition is a difficult topic in Music information retrieval research area. The first approaches borrowed successful techniques widely used in Automatic speech Recognition (ASR) as speech and singing share similar acoustical feature since they are produced by the same apparatus. Moving from monophonic to polyphonic audio signal the problem become more complex as the background instrumental accompaniment is regarded as a noise source that has to be attenuated. This paper proposes a singing voice recognition algorithm that is able to automatically recognize the word in a singing signal with background music by using the concept of spectrogram pattern matching. The main idea is to apply both the spectrogram and the image processing methods to solve the problem of singing voice recognition. Each signal that accompanies music is analyzed and generated to its spectrogram that is used to train data for the classifier. Several classification functions are compared, such as Fisher classifier, Feed-Forward can effectively recognize the word in music with the accuracy rate more than 84%.

[1]  Jhing-Fa Wang,et al.  Robust Environmental Sound Recognition for Home Automation , 2008, IEEE Transactions on Automation Science and Engineering.

[2]  Richard M. Stern,et al.  The effects of background music on speech recognition accuracy , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  C.D. Stylios,et al.  Speech Sound Classification and Detection of Articulation Disorders with Support Vector Machines and Wavelets , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[4]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[5]  Hervé Bourlard,et al.  Speech/music segmentation using entropy and dynamism features in a HMM classification framework , 2003, Speech Commun..

[6]  Kaamran Raahemifar,et al.  Content based audio classification and retrieval using joint time-frequency analysis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  O. Makeyev,et al.  Limited receptive area neural classifier for recognition of swallowing sounds using continuous wavelet transform , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[8]  Jie Huang,et al.  Environmental sound recognition by the instantaneous spectrum combined with the time pattern of power , 2004, Neural Networks and Computational Intelligence.

[9]  Masataka Goto,et al.  An auto-regressive, non-stationary excited signal parameter estimation method and an evaluation of a singing-voice recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Robert P.W. Duin,et al.  PRTools3: A Matlab Toolbox for Pattern Recognition , 2000 .

[11]  Christian Dittmar,et al.  Phoneme Recognition in Popular Music , 2007, ISMIR.

[12]  Jie Huang,et al.  Environmental sound recognition by multilayered neural networks , 2004, The Fourth International Conference onComputer and Information Technology, 2004. CIT '04..

[13]  Andrzej Czyzewski,et al.  Automatic Singing Voice Recognition Employing Neural Networks and Rough Sets , 2007, RSEISP.

[14]  Jean Laroche,et al.  New phase-vocoder techniques for pitch-shifting, harmonizing and other exotic effects , 1999, Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. WASPAA'99 (Cat. No.99TH8452).

[15]  Lonce L. Wyse,et al.  Generic Audio Classification Using a Hybrid Model Based on GMMs and HMMs , 2005, 11th International Multimedia Modelling Conference.

[16]  Masataka Goto,et al.  Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With Harmonic Structure Suppression , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Alicja Wieczorkowska,et al.  Music Information Retrieval , 2009, Encyclopedia of Data Warehousing and Mining.

[18]  Mark A. Clements,et al.  Concatenation-Based MIDI-to-Singing Voice Synthesis , 1997 .

[19]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[20]  J. L. Flanagan,et al.  PHASE VOCODER , 2008 .

[21]  Michael R. Neuman,et al.  Limited receptive area neural classifier for recognition of swallowing sounds using short-time Fourier transform , 2007, 2007 International Joint Conference on Neural Networks.

[22]  Mark Dolson,et al.  The Phase Vocoder: A Tutorial , 1986 .

[23]  Fred Popowich,et al.  Computationally measurable differences between speech and song , 2003 .

[24]  Trieu-Kien Truong,et al.  Audio classification and categorization based on wavelets and support vector Machine , 2005, IEEE Transactions on Speech and Audio Processing.

[25]  Pedro Cano,et al.  Low-Delay Singing Voice Alignment to Text , 1999, ICMC.

[26]  Yuichi Yaguchi,et al.  Song Wave Retrieval Based on Frame-Wise Phoneme Recognition , 2005, AIRS.