Random Indexing K-tree

Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering. RI K-tree scales well with large inputs due to its low complexity. It also exhibits features that are useful for managing a changing collection. Furthermore, it solves previous issues with sparse document vectors when using K-tree. The algorithms and data structures are defined, explained and motivated. Specific modifications to K-tree are made for use with RI. Experiments have been executed to measure quality. The results indicate that RI K-tree improves document cluster quality over the original K-tree algorithm.

[1]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[2]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[3]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[4]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  Geoffrey E. Hinton,et al.  Distributed representations and nested compositional structure , 1994 .

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Shlomo Geva,et al.  Document Clustering with K-tree , 2008, INEX.

[9]  G. Zipf,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. , 1949 .

[10]  A. Föhrenbach,et al.  SIMPLE++ , 2000, OR Spectr..

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[13]  Shlomo Geva K-tree: a height balanced tree structured vector quantizer , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[14]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[15]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[16]  Ludovic Denoyer,et al.  The Wikipedia XML corpus , 2006, SIGF.

[17]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[18]  Pentti Kanerva,et al.  The Spatter Code for Encoding Concepts at Many Levels , 1994 .

[19]  Gabriella Kazai Initiative for the Evaluation of XML Retrieval , 2009 .

[20]  Shlomo Geva,et al.  K-tree: large scale document clustering , 2009, SIGIR.