Hierarchical Web Page Classification Based on a Topic Model and Neighboring Pages Integration

Most Web page classification models typically apply the bag of words (BOW) model to represent the feature space. The original BOW representation, however, is unable to recognize semantic relationships between terms. One possible solution is to apply the topic model approach based on the Latent Dirichlet Allocation algorithm to cluster the term features into a set of latent topics. Terms assigned into the same topic are semantically related. In this paper, we propose a novel hierarchical classification method based on a topic model and by integrating additional term features from neighboring pages. Our hierarchical classification method consists of two phases: (1) feature representation by using a topic model and integrating neighboring pages, and (2) hierarchical Support Vector Machines (SVM) classification model constructed from a confusion matrix. From the experimental results, the approach of using the proposed hierarchical SVM model by integrating current page with neighboring pages via the topic model yielded the best performance with the accuracy equal to 90.33% and the F1 measure of 90.14%; an improvement of 5.12% and 5.13% over the original SVM model, respectively.

[1]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[2]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[3]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[6]  Yang Hong,et al.  Music Genre Classification , 2011 .

[7]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[8]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[9]  Choochart Haruechaiyasak,et al.  Article Recommendation Based on a Topic Model for Wikipedia Selection for Schools , 2008, ICADL.

[10]  Brian D. Davison,et al.  Classifiers without borders: incorporating fielded text from neighboring web pages , 2008, SIGIR '08.

[11]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[12]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[13]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[14]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[15]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[16]  Guangyu Chen,et al.  Web page genre classification , 2008, SAC '08.

[17]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.