论文信息 - Classification of HTML documents by Hidden Tree-Markov Models

Classification of HTML documents by Hidden Tree-Markov Models

Content-based search and organization of Web documents poses new issues in information retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are split into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.

Franco Scarselli | Marco Gori | Marco Maggini | Michelangelo Diligenti

[1] Andrew McCallum,et al. Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[2] Paolo Frasconi,et al. Image Document Categorization Using Hidden Tree Markov Models and Structured Representations , 2001, ICAPR.

[3] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4] Thorsten Joachims,et al. Web Watcher: A Tour Guide for the World Wide Web , 1997, IJCAI.

[5] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6] David Heckerman,et al. Bayesian Networks for Data Mining , 2004, Data Mining and Knowledge Discovery.

[7] Yoav Shoham,et al. Fab: content-based, collaborative recommendation , 1997, CACM.

[8] Tom M. Mitchell,et al. Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[9] Michael J. Pazzani,et al. Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[10] Yoav Shoham,et al. Fab: content-based, collaborative recommendation.(Special Section: Recommender Systems) , 1997 .