Extracting Content from Web Pages Based on RSS
暂无分享,去创建一个
This paper proposes a new method to content extraction from Web pages based on an index of RSS. Discover the collection of structural similarity web page documents in the RSS feed, and find the page template with the algorithm. By computing the feature of content blocks, obtain the body template. And achieve to a batch extraction from Web page in this collection finally. The method has a strong fault tolerance for the Web documents. And the results showed that it has high accuracy and widely adaptive.
[1] Jan-Ming Ho,et al. Discovering informative content blocks from Web documents , 2002, KDD.
[2] Calton Pu,et al. A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.
[3] Yu Man-quan. Research and design of HTML parser based on page segmentation , 2005 .