Efficient Lyrics Retrieval and Alignment

We present an algorithm to efficiently retrieve from the Web multiple versions of the lyrics of a given song. First, multiple web pages are collected that potentially contain the lyrics of the given song, by querying Google with the song title and artist name. Next, from each of these web pages, the part that probably contains the lyrics is efficiently extracted by making explicit use of the structural properties of lyrics. In addition, we present an efficient approximation algorithm to align the multiple lyrics versions. Multiple sequence alignment is a known NPhard problem, and we propose an approximation algorithm that is much more efficient than an algorithm proposed in the literature for this application. We present results that we obtained for a set of 258 songs, illustrating that by using our approach we are able to extract relevant lyrics for 97% of them.