Domain Identification and Classification of Web Pages Using Artificial Neural Network

A huge amount of data has been made available on the WWW [3] lately most of which remains inaccessible to the usual Web crawlers as those web pages are generated dynamically in response to users queries through Web based search form interfaces [5, 6, 9]. A Hidden Web crawler must be able to automatically annotate such Hidden Web data. The goal can only be accomplished if the crawler has been provided with some knowledge or data that pertains to a domain similar to that of the search form interface. The paper seems to provide a solution in this regard by exploiting the information present in the HTML structure of the Web pages, efficiently obtaining domain specific data to facilitate the crawler’s access to the dynamic web pages through automatic processing of these search form interfaces. Finding the domain of the webpage further eases the process of organization and understanding of the web content.