论文信息 - Automatic Data Extraction from Lists in Web Pages Based on XML

Automatic Data Extraction from Lists in Web Pages Based on XML

This paper proposes an automatic web information extraction method based on XML. Using the similarity of information structure in the web page template to create the DOM tree, it gets the recording mode of web information automatically by analyzing the PathPattern of the DOM tree. The whole process of this approach is fully automatic, avoiding any sample collection and man-made mark. Besides, some experiments were made to test the approach. It proved that this approach is totally feasible.

Wang Hao | Zhou Xin

[1] David W. Embley,et al. Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[2] Bing Liu,et al. Extracting Web Data Using Instance-Based Learning , 2007, World Wide Web.

[3] Mitesh Patel,et al. Structured databases on the web: observations and implications , 2004, SGMD.

[4] Mirina Grosz,et al. World Wide Web Consortium , 2010 .

[5] Berthier A. Ribeiro-Neto,et al. A brief survey of web data extraction tools , 2002, SGMD.

[6] Tansel Özyer,et al. Employing Clustering Techniques for Automatic Information Extraction From HTML Documents , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.