SSUP - A URL-Based Method to Entity-Page Discovery

Entity-pages are Web pages that publish data representing one only instance of a certain conceptual entity. In this paper we propose SSUP, a new method to entity-page discovery. Specifically, given a sample entity-page from a Web site (e.g., Jolyon Palmer entity-page from GP2 Web site) we aim to find all same type entity-pages (driver entity-pages) from this Web site. We propose two structural URL similarity metrics and a set of algorithms to combine URL features with HTML features in order to improve the quality results and minimize the number of downloaded pages and processing time. We evaluate our method in real world Web sites and compare it with two baselines to demonstrate the effectiveness of our method.