Spectral Clustering Wikipedia Keyword-Based Search Results

The paper presents an application of spectral clustering algorithms used for grouping Wikipedia search results. The main contribution of the paper is a representation method for Wikipedia articles that has been based on combination of words and links and it has been used to categorize search result in this repository. We evaluate the proposed approach with Primary Component Analysis and show, on the test data, how usage of cosine transformation to create combined representations influence data variability. On sample test datasets we also show how combined representation improves the data separation that increases overall results of data categorization. The paper reviews the three main spectral clustering methods and we test their usability for text categorization comparing them using external validation criteria with standard clustering quality measures. Discussion on descriptiveness of evaluation measures and performed experiments on test datasets allows us to select the one spectral clustering algorithm that has been implemented in our system. We give a brief description of the system architecture that groups on-line Wikipedia articles retrieved with user-specified keywords. Using the system we show how clustering increases information retrieval effectiveness for Wikipedia data repository.

[1]  Ignacio Rojas,et al.  Using cited references to improve the retrieval of related biomedical documents , 2013, BMC Bioinformatics.

[2]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[3]  Jacek M. Leski,et al.  Hierarchical Agglomerative Clustering of Time-Warped Series , 2017, ICMMI.

[4]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[5]  Duo Chun-hong An Improved Density-based DBSCAN Clustering Algorithm , 2007 .

[6]  James Bailey,et al.  Document clustering of scientific texts using citation contexts , 2010, Information Retrieval.

[7]  Julian Szymański,et al.  Comparative Analysis of Text Representation Methods Using Classification , 2014, Cybern. Syst..

[8]  Qingsheng Zhu,et al.  Spectral clustering with density sensitive similarity function , 2011, Knowl. Based Syst..

[9]  D. Cvetkovic,et al.  Spectra of Graphs: Theory and Applications , 1997 .

[10]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  Julian Szymanski,et al.  Annotating Words Using WordNet Semantic Glosses , 2012, ICONIP.

[13]  Hongjie Jia,et al.  The latest research progress on spectral clustering , 2013, Neural Computing and Applications.

[14]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[15]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[16]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[17]  Julian Szymanski,et al.  Creating categories for Wikipedia articles using Self-Organizing Maps , 2011, 2011 International Conference on Communications, Computing and Control Applications (CCCA).

[18]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[19]  Shamik Sural,et al.  Similarity between Euclidean and cosine angle distance for nearest neighbor queries , 2004, SAC '04.

[20]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[21]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[22]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[23]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[24]  Johan A. K. Suykens,et al.  Kernel Spectral Clustering for Big Data Networks , 2013, Entropy.

[25]  Julian Szymanski Towards Automatic Classification of Wikipedia Content , 2010, IDEAL.

[26]  Allan Collins,et al.  A spreading-activation theory of semantic processing , 1975 .

[27]  James Bailey,et al.  Improving MeSH classification of biomedical articles using citation contexts , 2011, J. Biomed. Informatics.

[28]  Julian Szymanski Categorization of Wikipedia Articles with Spectral Clustering , 2011, IDEAL.

[29]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[30]  Sergiy I. Bogucharskiy,et al.  HIERARCHICAL AGGLOMERATIVE CLUSTERING IN MULTIMEDIA DATABASE , 2015 .

[31]  Julian Szymanski,et al.  Wikipedia Articles Representation with Matrix'u , 2013, ICDCIT.

[32]  Martti Juhola,et al.  On principal component analysis, cosine and Euclidean measures in information retrieval , 2007, Inf. Sci..

[33]  Kristian J. Hammond,et al.  Automatically indexing documents: content vs. reference , 2002, IUI '02.

[34]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Paweł Lubomski,et al.  Ujednoznacznianie słów przy użyciu słownika WordNet , 2008 .

[36]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[37]  Arash Joorabchi,et al.  A citation-based approach to automatic topical indexing of scientific literature , 2010, J. Inf. Sci..

[38]  Wlodzislaw Duch,et al.  Neurolinguistic approach to natural language processing with applications to medical text analysis , 2008, Neural Networks.

[39]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[40]  Marina Meila,et al.  A Comparison of Spectral Clustering Algorithms , 2003 .

[41]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[42]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[43]  Henryk Krawczyk,et al.  Interactive Information Search in Text Data Collections , 2013, Intelligent Tools for Building a Scientific Information Platform.

[44]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[45]  Dongming Lu,et al.  A Technique for Improving the Performance of Naive Bayes Text Classification , 2011, WISM.

[46]  Julian Szymanski,et al.  Representation of Hypertext Documents Based on Terms, Links and Text Compressibility , 2010, ICONIP.

[47]  Julian Szymanski,et al.  0-Step K-means for clustering Wikipedia search results , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.