Content extraction of Web pages based on characteristic symbols

With the popularity of the Internet,the large amounts of data on the Web provides many challenges for data mining techniques,especially for content extraction of Web pages.The existing methods can not guarantee the generality and effectiveness of Web mining approaches.By studying the internal structure of Web pages,this paper proposed an improved document tree model and discovered the general rules for analyzing it.In addition,extracted content from Web pages based on characteristic symbols.The experimental results prove that the proposed method is accurate as well as generic.