Automatic Data Extraction from Lists in Web Pages Based on XML

This paper proposes an automatic web information extraction method based on XML. Using the similarity of information structure in the web page template to create the DOM tree, it gets the recording mode of web information automatically by analyzing the PathPattern of the DOM tree. The whole process of this approach is fully automatic, avoiding any sample collection and man-made mark. Besides, some experiments were made to test the approach. It proved that this approach is totally feasible.