Dynamicity vs. effectiveness: studying online clustering for scatter/gather

We proposed and implemented a novel clustering algorithm called LAIR2, which has constant running time average for on-the-fly Scatter/Gather browsing [4]. Our experiments showed that when running on a single processor, the LAIR2 on-line clustering algorithm was several hundred times faster than a parallel Buckshot algorithm running on multiple processors [11]. This paper reports on a study that examined the effectiveness of the LAIR2 algorithm in terms of clustering quality and its impact on retrieval performance. We conducted a user study on 24 subjects to evaluate on-the-fly LAIR2 clustering in Scatter/Gather search tasks by comparing its performance to the Buckshot algorithm, a classic method for Scatter/Gather browsing [4]. Results showed significant differences in terms of subjective perceptions of clustering quality. Subjects perceived that the LAIR2 algorithm produced significantly better quality clusters than the Buckshot method did. Subjects felt that it took less effort to complete the tasks with the LAIR2 system, which was more effective in helping them in the tasks. Interesting patterns also emerged from subjects' comments in the final open-ended questionnaire. We discuss implications and future research.

[1]  Carolyn J. Crouch,et al.  The use of cluster hierarchies in hypertext information retrieval , 1989, Hypertext.

[2]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[3]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[4]  David R. Karger,et al.  Scatter/Gather as a Tool for the Navigation of Retrieval Results , 1995 .

[5]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[6]  Marti A. Hearst,et al.  Scatter/gather browsing communicates the topic structure of a very large text collection , 1996, CHI.

[7]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[8]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[9]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[10]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[11]  Robert Villa,et al.  The effectiveness of query-specific hierarchic clustering in information retrieval , 2002, Inf. Process. Manag..

[12]  Ophir Frieder,et al.  Parallelizing the buckshot algorithm for efficient document clustering , 2002, CIKM '02.

[13]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[14]  James Allan,et al.  HARD Track Overview in TREC 2003: High Accuracy Retrieval from Documents , 2003, TREC.

[15]  Tao Li,et al.  Document clustering via adaptive subspace iteration , 2004, SIGIR '04.

[16]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[17]  Weimao Ke,et al.  Toward responsive visualization services for scatter/gather browsing , 2008, ASIST.