Importance of HTML Structural Elements and Metadata in Automated Subject Classification

The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.

[1]  John M. Pierre,et al.  On the Automated Classification of Web Sites , 2001, ArXiv.

[2]  Elaine Svenonius The Intellectual Foundation of Information Organization , 2000 .

[3]  Michelangelo Ceci,et al.  Hierarchical Classification of HTML Documents with WebClassII , 2003, ECIR.

[4]  Yiming Yang,et al.  Hypertext Categorization using Hyperlink Patterns and Meta Data , 2001, ICML.

[5]  Aaron Finerman An Editorial Note , 1969, CSUR.

[6]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[9]  Richard M. Everson,et al.  When Are Links Useful? Experiments in Text Classification , 2003, ECIR.

[10]  Dennis Nicholson The Intellectual Foundation of Information Organization , 2003 .

[11]  Giuseppe Attardi,et al.  Automatic Web Page Categorization by Link and Context Analysis , 1999 .

[12]  Douglas Tudhope,et al.  Navigation via Similarity: Automatic Linking Based on Semantic Closeness , 1997, Inf. Process. Manag..

[13]  Jugal K. Kalita,et al.  Summarization as feature selection for text categorization , 2001, CIKM '01.

[14]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[15]  K. Wierenga DESIRE: Development of a European Service for Information on Research and Education , 1996 .

[16]  Anders Ardö,et al.  Automatic classification applied to the full-text Internet documents in a robot-generated subject index , 1999 .

[17]  Hope A. Olson,et al.  Subject Analysis in Online Catalogs , 2001 .

[18]  Johannes Fürnkranz,et al.  Hyperlink ensembles: a case study in hypertext classification , 2002, Inf. Fusion.

[19]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[20]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[21]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.