Text Learning and Hierarchical Feature Selection in Webpage Classification

One of the solutions of retrieving information from the Internet is by classifying web pages automatically. In almost all classification methods that have been published, feature selection is a very important issue. Although there are many feature selection methods has been proposed. Most of them focus on the features within a category and ignore that the hierarchy of categories also plays an important role in achieving accurate classification results. This paper proposes a new feature selection method that incorporates hierarchical information, which prevents the classifying process from going through every node in the hierarchy. Our test results show that our classification algorithm using hierarchical information reduces the search complexity from n to log(n) and increases the accuracy by 6.2% comparing to a related algorithm.