This work aims to approach web pages categorization by means of semantic graphs and machine learning techniques. We propose to use a semantic graph that can provide a compact and structured representation of the concepts present in a document in order to take into account the semantic information. The semantic graph allows determining a map of the semantic areas contained in the document and their relationships w.r.t. a particular concept or term. The semantic measure between the terms is calculated by using the lexical database (i.e., WordNet). The document categorization is accomplished by a machine learning technique. We compare the performance of both supervised and unsupervised techniques (i.e., Support Vector Machine and Self Organizing Maps, respectively). The proposed methodology has been applied for classification and agglomeration of benchmark and real data. From the analysis of the results it can be shown that the model trained with semantic features obtains satisfactory results, in particular by using the unsupervised machine learning technique.
[1]
Teuvo Kohonen,et al.
The self-organizing map
,
1990,
Neurocomputing.
[2]
Gerard Salton,et al.
Term-Weighting Approaches in Automatic Text Retrieval
,
1988,
Inf. Process. Manag..
[3]
Bruno Trstenjak,et al.
on Intelligent Manufacturing and Automation , 2013 KNN with TF-IDF Based Framework for Text Categorization
,
2014
.
[4]
Dekang Lin,et al.
An Information-Theoretic Definition of Similarity
,
1998,
ICML.
[5]
Hinrich Schütze,et al.
Introduction to information retrieval
,
2008
.
[6]
Stephan Bloehdorn,et al.
Boosting for Text Classification with Semantic Features
,
2004,
WebKDD.
[7]
Corinna Cortes,et al.
Support-Vector Networks
,
1995,
Machine Learning.
[8]
Brian D. Davison,et al.
Web page classification: Features and algorithms
,
2009,
CSUR.
[9]
George A. Miller,et al.
Introduction to WordNet: An On-line Lexical Database
,
1990
.