Efficient Lyrics Extraction from the Web
暂无分享,去创建一个
We present a novel method to extract lyrics from the Web. The aim is to extract a set of multiple versions of the lyrics to a song. Lyrics can be identified within a text by a regular expression. We use a projection of a document to efficiently identify lyrics within the document by mapping it to a regular expression. We describe a method to cluster the multiple versions of the lyrics by filtering out erroneous texts such as lyrics to other songs. For reasons of efficiency, we do this by comparing fingerprints instead of the texts themselves.
[1] Peter Knees,et al. Multiple Lyrics Alignment: Automatic Retrieval of Song Lyrics , 2005, ISMIR.
[2] Valter Crescenzi,et al. Automatic information extraction from large websites , 2004, JACM.
[3] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .