A Framework for Automatic Classification of e-Business Web Content

Classifying the specific e-Business Web content in search results is presently done manually by human who visit a website and then define under which topic headings the web page or site belongs. This is a tedious and expensive process done by many portal websites. Unfortunately, it has become almost impossible for a casual user to look for specific e-Business information without getting lost among huge amounts of mixed data. In particular, retrieval failures arise because of the ambiguity of natural language. Queries are at best imperfect representations of the user's information needs and words chosen by the author are imperfect representations of the information contained in the document. It should come as no surprise that matching query words to document words, which is the heart of any information retrieval system, yields a very imperfect result. Moreover, the distributed nature of the WWW adds new problems to old: documents may be duplicated many times at many different sites; Web pages are added at alarming rates, creating an extremely dynamic information environment; the quality of information contained in Web pages varies greatly; Web pages are deleted or moved frequently, leaving behind dangling references. This paper proposes a framework and a system implementation for automatic eBusiness Web content classification in search results, which tries to fill the gaps mentioned above by using present research techniques including study of human retrieval behavior and other information placed inside the html code itself. In order to evaluate the system, two test sets (offline and online) were taken under consideration. For offline testing, we used 7 e-Business Web collection groups from CMU World Wide Knowledge Base, 1250 Web pages for training and 2000 Web pages for testing. For online testing, we used Web collection from the results in search engine for testing. Both of the results show that the average system performance is about 85 %.