Ontology-Based Text Classification into Dynamically Defined Topics

We present a method for the automatic classification of text documents into a dynamically defined set of topics of interest. The proposed approach requires only a domain ontology and a set of user-defined classification topics, specified as contexts in the ontology. Our method is based on measuring the semantic similarity of the thematic graph created from a text document and the ontology sub-graphs resulting from the projection of the defined contexts. The domain ontology effectively becomes the classifier, where classification topics are expressed using the defined ontological contexts. In contrast to the traditional supervised categorization methods, the proposed method does not require a training set of documents. More importantly, our approach allows dynamically changing the classification topics without retraining of the classifier. In our experiments, we used the English language Wikipedia converted to an RDF ontology to categorize a corpus of current Web news documents into selection of topics of interest. The high accuracy achieved in our tests demonstrates the effectiveness of the proposed method, as well as the applicability of Wikipedia for semantic text categorization purposes.

[1]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[2]  Krys J. Kochut,et al.  Wikipedia in Action: Ontological Knowledge in Text Categorization , 2008, 2008 IEEE International Conference on Semantic Computing.

[3]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[4]  Timothy W. Finin,et al.  Wikipedia as an Ontology for Describing Documents , 2008, ICWSM.

[5]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[6]  Markus Krötzsch,et al.  Semantic Wikipedia , 2006, WikiSym '06.

[7]  Ramanathan V. Guha,et al.  Contexts for the Semantic Web , 2004, SEMWEB.

[8]  Roy Goldman,et al.  Views for Semistructured Data , 1997 .

[9]  Peter Sch Identifying document topics using the Wikipedia category network , 2006 .

[10]  Simone Paolo Ponzetto,et al.  Deriving a Large-Scale Taxonomy from Wikipedia , 2007, AAAI.

[11]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[12]  Jens Lehmann,et al.  What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content , 2007, ESWC.

[13]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[14]  Maciej Janik,et al.  Training-less ontology-based text categorization , 2008 .

[15]  John McCarthy,et al.  Notes on Formalizing Context , 1993, IJCAI.

[16]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[17]  Péter Schönhofen Identifying document topics using the Wikipedia category network , 2009, Web Intell. Agent Syst..

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Jakob Voß,et al.  Collaborative thesaurus tagging the Wikipedia way , 2006, ArXiv.

[20]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[21]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[22]  Takahiro Hara,et al.  Concept vector extraction from Wikipedia category network , 2009, ICUIMC '09.