A Hybrid Neural Network for Web Page Classification

Web page classification is one of the essential techniques for Web mining. The approach proposes a framework for Web page classification, that is a hybrid architecture using the PCA features selection approach and the SOFM with a combination of some conventional statistical methods. The proposed hybrid architecture consists of four modules as following: The page-page-preprocessing module is used to extract textual features of a document, what is divided into stopping and stemming. The stemming is a process of extracting each word from a document by reducing it to a possible root word. The stopping is a process of deleting the high frequent words with low content discriminating power in a document, such as ‘to', ‘a', ‘and', ‘it', etc.