Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System

Automatic sung speech recognition is a relatively understudied topic that has been held back by a lack of large and freely available datasets. This has recently changed thanks to the release of the DAMP Sing! dataset, a 1100 hour karaoke dataset originating from the social music-making company, Smule. This paper presents work undertaken to define an easily replicable, automatic speech recognition benchmark for this data. In particular, we describe how transcripts and alignments have been recovered from Karaoke prompts and timings; how suitable training, development and test sets have been defined with varying degrees of accent variability; and how language models have been developed using lyric data from the LyricWikia website. Initial recognition experiments have been performed using factored-layer TDNN acoustic models with lattice-free MMI training using Kaldi. The best WER is 19.60% – a new state-of-the-art for this type of data. The paper concludes with a discussion of the many challenging problems that remain to be solved. Dataset definitions and Kaldi scripts have been made available so that the benchmark is easily replicable.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Marly Mageau,et al.  Foreign Accents in Song and Speech , 2016 .

[4]  Keikichi Hirose,et al.  Results of aligning and reformatting the dictionary as a corpus of joint sequences . A ‘ , ’ indicates a oneto-many relationship , while ‘ , 2016 .

[5]  Tuomas Virtanen,et al.  Recognition of phonemes and words in singing , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[7]  Lauren Brittany Collister,et al.  Comparison of Word Intelligibility in Spoken and Sung Phrases , 2008 .

[8]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[9]  Lin-Shan Lee,et al.  Transcribing Lyrics from Commercial Song Audio: the First Step Towards Singing Content Processing , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Andy Gibson,et al.  Production and perception of vowels in New Zealand popular music , 2010 .

[11]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[12]  Anna M. Kruspe,et al.  Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing , 2016, ISMIR.

[13]  Carlos Gussenhoven,et al.  Singing Your Accent Away, and Why It Works , 2011, ICPhS.

[14]  Yiming Wang,et al.  Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Seiichi Nakagawa,et al.  Speech analysis of sung-speech and lyric recognition in monophonic singing , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[18]  Kyu J. Han,et al.  The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[19]  Monika Konert-Panek,et al.  Overshooting Americanisation. Accent stylisation in pop singing – acoustic properties of the bath and trap vowels in focus , 2017 .