Content Based Web Sampling

Web characterization methods have been studied for many years. Most of these methods focus on textbased web contents. Some of them analyze the contents of a web page by analyzing its HTML code, hyper links, and/or DOM 1 structure. Seldom, a web page is characterized based on its visual appearance. A good reason for also considering the visual appearance of a web page is because humans initially perceive a web page as an image, and only then will look in detail at text and further pictorial contents. Hence it is a more natural way of trying to analyze and classify the contents of the web pages. Moreover, as more and more new web technologies appear in recent years (JavaScript, FLASH 2 , and AJAX 3 ); analyzing the HTML code in a web page seems to be meaningless without actually parsing and interpreting it. This offers new challenges to textual web page characterization and has an impact on the efficiency of the indexing techniques. Thus, by combining the old text classification methods with our novel (visual) content based methods we offer a more promising way to characterize the web. The main idea of the project is to take snapshot for each page and uses image classification methods to categorize them.