Machine Learning-Based Web Documents Categorization by Semantic Graphs

This work aims to approach web pages categorization by means of semantic graphs and machine learning techniques. We propose to use a semantic graph that can provide a compact and structured representation of the concepts present in a document in order to take into account the semantic information. The semantic graph allows determining a map of the semantic areas contained in the document and their relationships w.r.t. a particular concept or term. The semantic measure between the terms is calculated by using the lexical database (i.e., WordNet). The document categorization is accomplished by a machine learning technique. We compare the performance of both supervised and unsupervised techniques (i.e., Support Vector Machine and Self Organizing Maps, respectively). The proposed methodology has been applied for classification and agglomeration of benchmark and real data. From the analysis of the results it can be shown that the model trained with semantic features obtains satisfactory results, in particular by using the unsupervised machine learning technique.