Exploration of Document Classification with Linked Data and PageRank

In this article, we would like to present a new approach to classification using Linked Data and PageRank. Our research is focused on classification methods that are enhanced by semantic information. The semantic information can be obtained from ontology or from Linked Data. DBpedia was used as a source of Linked Data in our case. The feature selection method is semantically based so features can be recognized by non-professional users as they are in a human readable and understandable form. PageRank is used during the feature selection and generation phase for the expansion of basic features into more general representatives. This means that feature selection and PageRank processing is based on network relations obtained from Linked Data. The discovered features can be used by standard classification algorithms. We will present promising results that show the simple applicability of this approach to two different datasets.

[1]  Amy Nicole Langville,et al.  Google's PageRank and beyond - the science of search engine rankings , 2006 .

[2]  Gerard de Melo,et al.  Multilingual Text Classification Using Ontologies , 2007, ECIR.

[3]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[4]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[5]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[8]  Hugh Glaser,et al.  URI Disambiguation in the Context of Linked Data , 2008, LDOW.

[9]  Pushpak Bhattacharyya,et al.  Text Representation with WordNet Synsets Using Soft Sense Disambiguation , 2003, Ingénierie des Systèmes d Inf..

[10]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[11]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[12]  Yi Zhao,et al.  Bringing PageRank to the citation analysis , 2008, Inf. Process. Manag..

[13]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[14]  Xuemin Lin,et al.  Term Graph Model for Text Classification , 2005, ADMA.

[15]  John Yen,et al.  Advances in Web Mining and Web Usage Analysis, 8th International Workshop on Knowledge Discovery on the Web, WebKDD 2006, Philadelphia, PA, USA, August 20, 2006, Revised Papers , 2007, WebKDD.

[16]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.