NASS: News Annotation Semantic System

Today in media companies there is a serious problem for cataloging news due to the large number of articles received by the documentation departments. That manual labor is subject to many errors and omissions because of the different points of view and expertise level of each staff member. There is also an additional difficulty due to the large size of the list of words in a thesaurus. In this paper, we present a new method for solving the problem of text categorization over a corpus of newspaper articles where the annotation must be composed of thesaurus elements. The method consists of applying lemmatization, obtaining keywords and named entities, and finally using a combination of Support Vector Machines (SVM), ontologies and heuristics to infer appropriate tags for the annotation. We carried out a detailed evaluation of our method with real newspaper articles, and we compared out tagging with the annotation performed by a real documentation department, obtaining really promising results.