DETECTING WEBSITE REDESIGNS VIA TEMPLATE SIMILARITY ON STREAMS OF DOCUMENTS

Most websites undergo a redesign from time to time. Along with the change of the appearance of the site comes a different document structure. Hence, redesigns can be detected by observing changes in the structural similarity of monitored HTML documents. Assuming further to monitor not a fixed document set but a series of the newest documents (e.g. provided by an RSS feed) transforms the task of redesign detection into a particular change detection operation on streams of documents. This paper describes and evaluates a simple and three more elaborated approaches to the problem. We show that the detection of redesigns can be achieved automatically, effective and efficient.