论文信息 - Mining Wikipedia Knowledge to improve document indexing and classification

Mining Wikipedia Knowledge to improve document indexing and classification

Weblogs are an importan source of information that requires automatic techniques to categorize them into “topic-based” content, to facilitate their future browsing and retrieval. In this paper we propose and illustrate the effectiveness of a new tf. idf measure. The proposed Conf.idf, Catf.idf measures are solely based on the mapping of terms-to-concepts-to-categories (TCONCAT) method that utilizes Wikipedia. The Knowledge base-Wikipedia is considered as a large scale Web encyclopaedia, that has high-quality and huge number of articles and categorical indexes. Using this system, our proposed framework consists of two stages to solve weblog classification problem. The first stage is to find out the terms belonging to a unique concept (article), as well as to disambiguate the terms belonging to more than one concept. The second stage is the determination of the categories to which these found concepts belong to. Experimental result confirms that, proposed system can distinguish the weblogs that belongs to more than one category efficiently and has a better performance and success than the traditional statistical Natural Language Processing-NLP approaches.

Saadat M. Alhashmi | Ramesh Kumar Ayyasamy | Bashar Tahayna | Eu-Gene Siew | Simon Egerton

[1] Aixin Sun,et al. Blog Classification Using Tags: An Empirical Study , 2007, ICADL.

[2] Takahiro Hara,et al. Concept vector extraction from Wikipedia category network , 2009, ICUIMC '09.

[3] Man Lan,et al. A comparative study on term weighting schemes for text categorization , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[4] Péter Schönhofen,et al. Identifying Document Topics Using the Wikipedia Category Network , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[5] Steffen Staab,et al. WordNet improves text document clustering , 2003, SIGIR 2003.

[6] Guy W. Mineau,et al. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[7] Evgeniy Gabrilovich,et al. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[8] Michael K. Ng,et al. Knowledge-based vector space model for text clustering , 2010, Knowledge and Information Systems.

[9] Chew Lim Tan,et al. Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[10] Jian Hu,et al. Using Wikipedia knowledge to improve text classification , 2009, Knowledge and Information Systems.

[11] Carlotta Domeniconi,et al. Building semantic kernels for text classification using wikipedia , 2008, KDD.