An Automated Algorithm for Extracting Website Skeleton

The huge amount of information available on the Web has attracted many research efforts into developing wrappers that extract data from webpages. However, as most of the systems for generating wrappers focus on extracting data at page-level, data extraction at site-level remains a manual or semi-automatic process. In this paper, we study the problem of extracting website skeleton, i.e. extracting the underlying hyperlink structure that is used to organize the content pages in a given website. We propose an automated algorithm, called the Sew algorithm, to discover the skeleton of a website. Given a page, the algorithm examines hyperlinks in groups and identifies the navigation links that point to pages in the next level in the website structure. The entire skeleton is then constructed by recursively fetching pages pointed by the discovered links and analyzing these pages using the same process. Our experiments on real life websites show that the algorithm achieves a high recall with moderate precision.

[1]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[2]  Hans-Peter Kriegel,et al.  Web site mining: a new way to spot competitors, customers and suppliers in the world wide web , 2002, KDD.

[3]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[4]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[5]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[6]  Ming-Syan Chen,et al.  Entropy-based link analysis for mining web informative structures , 2002, CIKM '02.

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[9]  Wen-Syan Li,et al.  Constructing multi-granular and topic-focused web site maps , 2001, WWW '01.

[10]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[11]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[12]  Feifei Li,et al.  A visual tool for building logical data models of websites , 2002, WIDM '02.

[13]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[14]  Paolo Merialdo,et al.  To Weave the Web , 1997, VLDB.