Construction of a hierarchical classifier schema using a combination of text-based and image-based approaches

Web document hierarchical classification approaches often rely on textual features alone even though web pages include multimedia data. We propose a new hierarchical integrated web classification approach that combines image-based and text-based approaches. Instead of using a flat classifier to combine text and image classification, we perform classification on a hierarchy differently on different levels of the tree, using text for branches and images only at leaves. The results of our experiments show that the use of the hierarchical structure improved web document classification performance significantly.