论文信息 - Web data extraction based on structural similarity

Web data extraction based on structural similarity

Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.

Zhao Li | Wee Keong Ng | Aixin Sun

[1] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[2] Hector Garcia-Molina,et al. Extracting structured data from Web pages , 2003, SIGMOD '03.

[3] Anand Rajaraman,et al. Virtual database technology , 1997, SGMD.

[4] Jan-Ming Ho,et al. Discovering informative content blocks from Web documents , 2002, KDD.

[5] Hiroshi Sakamoto,et al. Extracting Partial Structures from HTML Documents , 2000, FLAIRS.

[6] Nicholas Kushmerick,et al. Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[7] Elio Masciari,et al. Detecting Structural Similarities between XML Documents , 2002, WebDB.

[8] Chia-Hui Chang,et al. IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[9] Feifei Li,et al. Wiccap Data Model: Mapping Physical Websites to Logical Views , 2002, ER.

[10] Siu-Ming Yiu,et al. An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.