Efficient Lyrics Extraction from the Web

We present a novel method to extract lyrics from the Web. The aim is to extract a set of multiple versions of the lyrics to a song. Lyrics can be identified within a text by a regular expression. We use a projection of a document to efficiently identify lyrics within the document by mapping it to a regular expression. We describe a method to cluster the multiple versions of the lyrics by filtering out erroneous texts such as lyrics to other songs. For reasons of efficiency, we do this by comparing fingerprints instead of the texts themselves.