Web data extraction based on structural similarity

Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.

[1]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[2]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[3]  Anand Rajaraman,et al.  Virtual database technology , 1997, SGMD.

[4]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[5]  Hiroshi Sakamoto,et al.  Extracting Partial Structures from HTML Documents , 2000, FLAIRS.

[6]  Nicholas Kushmerick,et al.  Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[7]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[8]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[9]  Feifei Li,et al.  Wiccap Data Model: Mapping Physical Websites to Logical Views , 2002, ER.

[10]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[12]  Alexandre Termier,et al.  TreeFinder: a first step towards XML data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[14]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[15]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[16]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[17]  Jeffrey D. Ullman,et al.  Querying websites using compact skeletons , 2003, J. Comput. Syst. Sci..

[18]  Maurice Bruynooghe,et al.  Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference , 2003, IJCAI.

[19]  達也 阿久津 An RNC Algorithm for Finding a Largest Common Subtree of Two Trees , 1991 .

[20]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.