Visual features in genre classification of html

Automatic genre classification historically has focused on extracting textual features from documents. In this research, we investigate whether visual features of HTML documents can improve the classification of fine grained genres. Three different sets of features were compared on a genre classification task in the e-commerce domain - one with just textual features, one with HTML features added, and a third with additional visual features. Our experiments show that adding HTML and visual features provides much better classification than textual features alone.

[1]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  Mark A. Rosso Using genre to improve web search , 2005 .

[3]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  Michal Cutler,et al.  The portrait of a common HTML web page , 2006, DocEng '06.

[5]  Kevin Crowston,et al.  Reproduced and emergent genres of communication on the World-Wide Web , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.