Identifying Clones in Dynamic Web Sites Using Similarity Thresholds

We propose an approach to automatically detect duplicated pages in dynamic Web sites and on the analysis of both the page structure, implemented by specific sequences of HTML tags, and the displayed content. In addition, for each pair of dynamic pages we also consider the similarity degree of their scripting code. The similarity degree of two pages is computed using different similarity metrics for the different parts of a web page based on the Levenshtein string edit distance. We have implemented a prototype to automate the clone detection process on web applications developed using JSP technology and used it to validate our approach

[1]  Shinji Kusumoto,et al.  CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code , 2002, IEEE Trans. Software Eng..

[2]  Magdalena Balazinska,et al.  Measuring clone based reengineering opportunities , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[3]  Cornelia Boldyreff,et al.  The evolution of Websites , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[4]  Paolo Tonella,et al.  Using clustering to support the migration from static to dynamic web pages , 2003, 11th IEEE International Workshop on Program Comprehension, 2003..

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Massimiliano Di Penta,et al.  An approach to identify duplicated web pages , 2002, Proceedings 26th Annual International Computer Software and Applications.

[7]  Lerina Aversano,et al.  Web site reuse: cloning and adapting , 2001, Proceedings 3rd International Workshop on Web Site Evolution. WSE 2001.

[8]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[9]  Paolo Tonella,et al.  Understanding and Restructuring Web Sites with ReWeb , 2001, IEEE Multim..

[10]  Filippo Lanubile,et al.  Finding function clones in Web applications , 2003, Seventh European Conference onSoftware Maintenance and Reengineering, 2003. Proceedings..