The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech

Despite a long-standing effort to characterize various aspects of the singing voice and their relations to speech, the lack of a suitable and publicly available dataset has precluded any systematic study on the quantitative difference between singing and speech at the phone level. We hereby present the NUS Sung and Spoken Lyrics Corpus (NUS-48E corpus) as the first step toward a large, phonetically annotated corpus for singing voice research. The corpus is a 169-min collection of audio recordings of the sung and spoken lyrics of 48 (20 unique) English songs by 12 subjects and a complete set of transcriptions and duration annotations at the phone level for all recordings of sung lyrics, comprising 25,474 phone instances. Using the NUS-48E corpus, we conducted a preliminary, quantitative study on the comparison between singing voice and speech. The study includes duration analyses of the sung and spoken lyrics, with a primary focus on the behavior of consonants, and experiments aiming to gauge how acoustic representations of spoken and sung phonemes differ, as well as how duration and pitch variations may affect the Mel Frequency Cepstral Coefficients (MFCC) features.

[1]  M. Albert,et al.  Melodic intonation therapy for aphasia. , 1973, Archives of neurology.

[2]  Suzanne L. Medina,et al.  The Effects of Music upon Second Language Vocabulary Acquisition. , 1990 .

[3]  J. Sundberg,et al.  The Science of Singing Voice , 1987 .

[4]  Masataka Goto,et al.  SpeakBySinging: Converting singing voices to speaking voices while retaining voice timbre , 2010 .

[5]  Björn W. Schuller,et al.  Feature Selection and Stacking for Robust Discrimination of Speech, Monophonic Singing, and Polyphonic Music , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[6]  Liang Gu,et al.  Robust singing detection in speech/music discriminator design , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Tuomas Virtanen,et al.  Automatic Recognition of Lyrics in Singing , 2010, EURASIP J. Audio Speech Music. Process..

[8]  Pedro Cano,et al.  Low-Delay Singing Voice Alignment to Text , 1999, ICMC.

[9]  Masataka Goto,et al.  On human capability and acoustic cues for discriminating singing and speaking voices , 2006 .

[10]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[11]  Jyh-Shing Roger Jang,et al.  On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Bin Ma,et al.  Voice conversion: From spoken vowels to singing vowels , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[13]  John H. L. Hansen,et al.  Singing speaker clustering based on subspace learning in the GMM mean supervector space , 2013, Speech Commun..

[14]  Masataka Goto,et al.  Discrimination between singing and speaking voices , 2005, INTERSPEECH.

[15]  Masataka Goto,et al.  Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.