An approach to identify duplicated web pages

A relevant consequence of the expansion of the web and e-commerce is the growth of the demand of new web sites and web applications. As a result, web sites and applications are usually developed without a formalized process, and web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve web sites and applications. Moreover, clone detection among different web sites aims to detect cases of possible plagiarism. In this paper we propose an approach. based on similarity metrics, to detect duplicated pages in web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several web sites and Web applications. The obtained results are reported in the paper with respect to some case studies.

[1]  Brenda S. Baker,et al.  A theory of parameterized pattern matching: algorithms and applications , 1993, STOC.

[2]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[3]  S. M. Ulam Some Combinatorial Problems Studied Experimentally on Computing Machines , 1972 .

[4]  Ettore Merlo,et al.  Experiment on the automatic detection of function clones in a software system using metrics , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[5]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[6]  Hugo T. Jankowitz Detecting Plagiarism in Student Pascal Programs , 1988, Comput. J..

[7]  Michel Dagenais,et al.  Extending software quality assessment techniques to Java systems , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[8]  Renato De Mori,et al.  Pattern matching for clone and concept detection , 2004, Automated Software Engineering.

[9]  Hal Berghel,et al.  Measurements of program similarity in identical task environments , 1984, SIGP.

[10]  Renato De Mori,et al.  Pattern matching for design concept localization , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[11]  Paolo Tonella,et al.  Web site analysis: structure and evolution , 2000, Proceedings 2000 International Conference on Software Maintenance.

[12]  Magdalena Balazinska,et al.  Advanced clone-analysis to support object-oriented system refactoring , 2000, Proceedings Seventh Working Conference on Reverse Engineering.

[13]  Magdalena Balazinska,et al.  Measuring clone based reengineering opportunities , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[14]  Cornelia Boldyreff,et al.  The evolution of Websites , 1999, Proceedings Seventh International Workshop on Program Comprehension.

[15]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Ettore Merlo,et al.  Assessing the benefits of incorporating function clone detection in a development process , 1997, 1997 Proceedings International Conference on Software Maintenance.

[18]  Kostas Kontogiannis,et al.  Evaluation experiments on the detection of programming patterns using software metrics , 1997, Proceedings of the Fourth Working Conference on Reverse Engineering.

[19]  Samuel L. Grier,et al.  A tool that detects plagiarism in Pascal programs , 1981, SIGCSE '81.

[20]  Dewayne E. Perry Proceedings International Conference on Software Maintenance , 1997, 1997 Proceedings International Conference on Software Maintenance.

[21]  Stefan Kurtz,et al.  Fundamental algorithms for a declarative pattern matching system , 1995 .

[22]  Susan Horwitz,et al.  Identifying the semantic and textual differences between two versions of a program , 1990, PLDI '90.