An Extraction Algorithm of Chinese HTML Content Based on Similarity

HTML Extraction is important to WEB Mining.A new web page content extracting method was proposed.It combined content similarity and tag similarity of line text to extract web page content.This approach avoided a traditional step called web page blocking when dealing with web pages.It first extracted the largest text line and computes the similarity of line text and line tags between each line,then,used text similarity and tag similarity to extract web page content.Finally some web pages have been collected to test this approach.In experiments,the accuracy of this approach closes to 95%,which shows that this method is effective in practice.