Recognition Of Phonemes In A-Cappella Recordings Using Temporal Patterns And Mel Frequency Cepstral Coefficients

In this paper, a new method for recognizing phonemes in singing is proposed. Recognizing phonemes in singing is a task that has not yet matured to a standardized method, in comparison to regular speech recognition. The standard methods for regular speech recognition have already been evaluated on vocal records, but their performances are lower compared to regular speech. In this paper, two alternative classification methods dealing with this issue are proposed. One uses Mel-Frequency Cepstral Coefficient features, while another uses Temporal Patterns. They are combined to create a new type of classifier which produces a better performance than the two separate classifiers. The classifications are done with US English songs. The preliminary result is a phoneme recall rate of 48.01% in average of all audio frames within a song.

[1]  Christian Dittmar,et al.  Towards lyrics spotting in the SyncGlobal project , 2012, 2012 3rd International Workshop on Cognitive Information Processing (CIP).

[2]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[3]  Tuomas Virtanen,et al.  Automatic Recognition of Lyrics in Singing , 2010, EURASIP J. Audio Speech Music. Process..

[4]  Yajie Hu,et al.  Lyric-based Song Emotion Detection with Affective Lexicon and Fuzzy Clustering Method , 2009, ISMIR.

[5]  Hiromasa Fujihara,et al.  Three techniques for improving automatic synchronization between music and lyrics: Fricative detection, filler model, and novel feature vectors for vocal activity detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Steve Renals,et al.  THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[8]  Christian Dittmar,et al.  Phoneme Recognition in Popular Music , 2007, ISMIR.

[9]  Jean-Luc Gauvain,et al.  Unsupervised acoustic model training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Alexander H. Waibel,et al.  The effects of room acoustics on MFCC speech parameter , 2000, INTERSPEECH.

[11]  Bernd Bischl,et al.  Perceptually Based Phoneme Recognition in Popular Music , 2010 .

[12]  Yunhe Pan,et al.  Popular Song Retrieval Based on Singing Matching , 2002, IEEE Pacific Rim Conference on Multimedia.

[13]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[14]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[15]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[16]  Kyung-shik Shin,et al.  An application of support vector machines in bankruptcy prediction model , 2005, Expert Syst. Appl..