Multilayer classification of web pages using random forest and semi-supervised latent dirichlet allocation

The classification of web pages content is essential to many information retrieval tasks. In this paper, we propose a new methodology for a multilayer soft classification. Our approach is based on the connection between the semi-supervised Latent Dirichlet Allocation (LDA) and the Random Forest classifier. We compute with LDA the distribution of topics in each document and use the results to train the Random Forest classifier. The trained classifier is then able to categorize each web document in different layers of the categories hierarchy. We have applied our methodology on a collected data set from dmoz and have obtained satisfactory results.

[1]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[2]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  E. Kleinberg An overtraining-resistant stochastic modeling method for pattern recognition , 1996 .

[5]  Myungsook Klassen,et al.  Web Document Classification by Keywords Using Random Forests , 2010, NDT.

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Nir Friedman,et al.  Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning , 2009 .

[10]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[11]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[12]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[13]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[14]  Yiannis Kompatsiaris,et al.  News Articles Classification Using Random Forests and Weighted Multimodal Features , 2014, IRFC.

[15]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[16]  T. Minka Estimating a Dirichlet distribution , 2012 .

[17]  Jussi Myllymaki Effective Web data extraction with standard XML technologies , 2002, Comput. Networks.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Ajay S. Patil,et al.  Automated Classification of Web Sites using Naive Bayesian Algorithm , 2012 .

[20]  Yunming Ye,et al.  An Improved Random Forest Classifier for Text Categorization , 2012, J. Comput..

[21]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[22]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[23]  Katsumi Nitta,et al.  Semi-supervised Latent Dirichlet Allocation for Multi-label Text Classification , 2013, IEA/AIE.

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .