论文信息 - Random Indexing K-tree

Random Indexing K-tree

Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering. RI K-tree scales well with large inputs due to its low complexity. It also exhibits features that are useful for managing a changing collection. Furthermore, it solves previous issues with sparse document vectors when using K-tree. The algorithms and data structures are defined, explained and motivated. Specific modifications to K-tree are made for use with RI. Experiments have been executed to measure quality. The results indicate that RI K-tree improves document cluster quality over the original K-tree algorithm.

Shlomo Geva | Lance De Vine | Christopher M. De Vries

[1] Allen Gersho,et al. Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[2] Heikki Mannila,et al. Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[3] Magnus Sahlgren,et al. An Introduction to Random Indexing , 2005 .

[4] George Karypis,et al. CLUTO - A Clustering Toolkit , 2002 .

[5] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6] Geoffrey E. Hinton,et al. Distributed representations and nested compositional structure , 1994 .

[7] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[8] Shlomo Geva,et al. Document Clustering with K-tree , 2008, INEX.

[9] G. Zipf,et al. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. , 1949 .

[10] A. Föhrenbach,et al. SIMPLE++ , 2000, OR Spectr..

[11] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[12] Dimitris Achlioptas,et al. Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[13] Shlomo Geva. K-tree: a height balanced tree structured vector quantizer , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[14] W. B. Johnson,et al. Extensions of Lipschitz mappings into Hilbert space , 1984 .

[15] K. Sparck Jones,et al. Simple, proven approaches to text retrieval , 1994 .

[16] Ludovic Denoyer,et al. The Wikipedia XML corpus , 2006, SIGF.

[17] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[18] Pentti Kanerva,et al. The Spatter Code for Encoding Concepts at Many Levels , 1994 .

[19] Gabriella Kazai. Initiative for the Evaluation of XML Retrieval , 2009 .

[20] Shlomo Geva,et al. K-tree: large scale document clustering , 2009, SIGIR.