Extracting Content from Web Pages Based on RSS

This paper proposes a new method to content extraction from Web pages based on an index of RSS. Discover the collection of structural similarity web page documents in the RSS feed, and find the page template with the algorithm. By computing the feature of content blocks, obtain the body template. And achieve to a batch extraction from Web page in this collection finally. The method has a strong fault tolerance for the Web documents. And the results showed that it has high accuracy and widely adaptive.