Today in media companies there is a serious problem for cataloging news due to the large number of articles received by the documentation departments. That manual labor is subject to many errors and omissions because of the different points of view and expertise level of each staff member. There is also an additional difficulty due to the large size of the list of words in a thesaurus. In this paper, we present a new method for solving the problem of text categorization over a corpus of newspaper articles where the annotation must be composed of thesaurus elements. The method consists of applying lemmatization, obtaining keywords and named entities, and finally using a combination of Support Vector Machines (SVM), ontologies and heuristics to infer appropriate tags for the annotation. We carried out a detailed evaluation of our method with real newspaper articles, and we compared out tagging with the annotation performed by a real documentation department, obtaining really promising results.
[1]
Alan F. Smeaton,et al.
Using NLP or NLP Resources for Information Retrieval Tasks
,
1999
.
[2]
Satoshi Sekine,et al.
Named entities : recognition, classification and use
,
2009
.
[3]
Stefan Wermter,et al.
Selforganizing Classification on the Reuters News Corpus
,
2002,
COLING.
[4]
Thorsten Joachims,et al.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
,
1998,
ECML.
[5]
Dejing Dou,et al.
Ontology-based information extraction: An introduction and a survey of current approaches
,
2010,
J. Inf. Sci..
[6]
Zoran Bosnic,et al.
Ontology-based multi-label classification of economic articles
,
2011,
Comput. Sci. Inf. Syst..
[7]
Gerhard Knolmayer,et al.
NewsCATS: A News Categorization and Trading System
,
2006,
Sixth International Conference on Data Mining (ICDM'06).