Web Page Main Text Extraction Based on Content Similarity

This paper proposes a method of simplifying complex Web page script and mapping it into tree structure easy to operate. It does not depend on DOM tree, and does not need utilize htmlparser bag to parse. By calculating text similarity, it calculates the similarity between the content of tree node and headings of different levels to determine the usefulness of the text information, cleans the Web page and extracts the content information. Experimental results show that the method has better universal property and accuracy rate in main text extraction.