Document-Document similarity matrix and Multiple-Kernel Fuzzy C-Means Algorithm-based web document clustering for information retrieval

Due to continuous development of World Wide Web, web database are growing massively where automatic grouping of web documents pose a new challenge for researchers to easily retrieve the information. Literature presents different algorithms for web document clustering useful for information retrieval. In this work, Document-Document similarity matrix and Multiple-Kernel Fuzzy C-Means Algorithm-based web document clustering is developed for information retrieval. At first, web documents are read and initial pre-processing are applied to extract the important words. Then, feature space is constructed using keywords and its frequency. Subsequently, document to document similarity matrix is constructed using the similarity measure, called semantic retrieval measure (SR). The measure considers four different criteria, such as, the probability of occurrence in the document, probability of occurrence in the first document, probability of occurrence in the second document and probability of occurrence in both synonyms set. Based on this measure, D-D matrix is computed to do the final grouping using Multiple-Kernel Fuzzy C-Means Algorithm. The experimentation is done with 100 web documents and the results are evaluated with accuracy and entropy.

[1]  Yun Yang,et al.  Temporal Data Clustering via Weighted Clustering Ensemble with Different Representations , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Pradipta Maji,et al.  Fuzzy–Rough Supervised Attribute Clustering Algorithm and Classification of Microarray Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[3]  Kazutoshi Sumiya,et al.  Inferred Information Retrieval with User Operations on Digital Maps , 2014, IEEE Internet Computing.

[4]  Sougata Mukherjea,et al.  Information retrieval and knowledge discovery utilizing a biomedical patent semantic Web , 2005, IEEE Transactions on Knowledge and Data Engineering.

[5]  Matthias Hemmje,et al.  ezDL: Collaborative Information Seeking and Retrieval in a Heterogeneous Environment , 2014, Computer.

[6]  Chunyan Miao,et al.  Context-Aware Personal Information Retrieval From Multiple Social Networks , 2014, IEEE Computational Intelligence Magazine.

[7]  Tero Laihonen,et al.  Codes for Information Retrieval With Small Uncertainty , 2014, IEEE Transactions on Information Theory.

[8]  Sanghamitra Bandyopadhyay,et al.  Multiobjective Simulated Annealing for Fuzzy Clustering With Stability and Validity , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Yunming Ye,et al.  TW-k-means: Automated two-level variable weighting clustering algorithm for multiview data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  C. L. Philip Chen,et al.  A Multiple-Kernel Fuzzy C-Means Algorithm for Image Segmentation , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Jiye Liang,et al.  The $K$-Means-Type Algorithms Versus Imbalanced Data Distributions , 2012, IEEE Transactions on Fuzzy Systems.

[12]  Ibrahim Kushchu,et al.  Web-based evolutionary and adaptive information retrieval , 2005, IEEE Transactions on Evolutionary Computation.

[13]  Ming Chen,et al.  Web-based information retrieval system for the prediction of metabolic pathways , 2004, IEEE Transactions on NanoBioscience.