Content extraction technique for web pages based on HTML-tags

An HTML element deleting method for extracting automatically the contents of a web page based on a technique of region sub-block is proposed by analyzing the data noise characteristics and its impact on the content of a web page and by using the structure characteristics of HTML-tags.The experiments show that the new method can extract effectively the main part of the contents of a web page in most cases.The tag analyzing method for HTML documents proposed can be used not only to extract the text of an HTML file,but also to obtain the contents of other HTML elements.