DIMENSIONALITY REDUCTION TECHNIQUES FOR SEARCH RESULTS CLUSTERING

Search results clustering is an attempt to automatically organise a linear list of document references returned by a search engine into a set of meaningful thematic categories. Such a clustered view helps the users to identify documents of interest more quickly. One search results clustering method is the description-comes-first approach, whereby using a dimensionality reduction technique a number of meaningful group labels are identified, which then determine the content of the actual clusters. The aim of this project was to compare how three different dimensionality reduction techniques would perform as parts of the description-comes-first method in terms of quality of clustering and computational efficiency. The evaluation stage was based on the standard merge-then-cluster model, in which we used the Open Directory Project web catalogue as a source of human-clustered document references. During the course of the project we implemented a number of dimensionality reduction techniques in Java and integrated them with our description-comes-first search results clustering algorithm. We also created a simple benchmarking application, which we used to gather data for further comparisons and analysis. Finally, we have chosen one dimensionality reduction technique that performed best both in terms of clustering quality and computational efficiency.

[1]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[2]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[3]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[4]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[5]  Hung Son Nguyen,et al.  A Tolerance Rough Set Approach to Clustering Web Search Results , 2004, PKDD.

[6]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[7]  Brian D. Davison,et al.  Human Performance on Clustering Web Pages: A Preliminary Study , 1998, KDD.

[8]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[9]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[10]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[11]  Anoop Sarkar,et al.  Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) , 2003 .

[12]  Dawid Weiss,et al.  Web Search Results Clustering in Polish: Experimental Evaluation of Carrot , 2003, IIS.

[13]  Gerald Salton,et al.  Automatic text processing , 1988 .

[14]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[15]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[16]  Oren Etzioni,et al.  Clustering web documents: a phrase-based method for grouping search engine results , 1999 .

[17]  Ø. Hammer,et al.  PAST: PALEONTOLOGICAL STATISTICAL SOFTWARE PACKAGE FOR EDUCATION AND DATA ANALYSIS , 2001 .

[18]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[19]  Yihong Gong,et al.  Document clustering by concept factorization , 2004, SIGIR '04.

[20]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[21]  Hiroshi Imai,et al.  Fast Algorithms for k-Word Proximity Search , 2001 .

[22]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[23]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[25]  Stefan M. Wild Seeding Non-Negative Matrix Factorizations with the Spherical K-Means Clustering , 2003 .

[26]  Stan Z. Li,et al.  Local non-negative matrix factorization as a visual representation , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[27]  Oren Etzioni,et al.  Towards comprehensive web search , 1999 .

[28]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..