Exploiting structural similarity for effective Web information extraction

In this paper, we propose a classification technique for Web pages, based on the detection of structural similarities among semistructured documents, and devise an architecture exploiting such technique for the purpose of information extraction. The proposal significantly differs from standard methods based on graph-matching algorithms, and is based on the idea of representing the structure of a document as a time series in which each occurrence of a tag corresponds to an impulse. The degree of similarity between documents is then stated by analyzing the frequencies of the corresponding Fourier transform. Experiments on real data show the effectiveness of the proposed technique.

[1]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[3]  Boaz Porat,et al.  A course in digital signal processing , 1996 .

[4]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[5]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[6]  Elisa Bertino,et al.  Matching an XML Document against a Set of DTDs , 2002, ISMIS.

[7]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[8]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[9]  Elio Masciari,et al.  Detecting Structural Similarities between XML Documents , 2002, WebDB.

[10]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[11]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[12]  Thomas Kistler,et al.  WebL - A Programming Language for the Web , 1998, Comput. Networks.

[13]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[14]  Serge Abiteboul,et al.  Detecting changes in XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.