KDX: an indexer for support vector machines

Support vector machines (SVMs) have been adopted by many data mining and information-retrieval applications for learning a mining or query concept, and then retrieving the "top-k" best matches to the concept. However, when the data set is large, naively scanning the entire data set to find the top matches is not scalable. In this work, we propose a kernel indexing strategy to substantially prune the search space and, thus, improve the performance of top-k queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quickly converges on an approximate set of top-k instances of interest. More importantly, once the kernel (e.g., Gaussian kernel) has been selected and the indexer has been constructed, the indexer can work with different kernel-parameter settings (e.g., gamma and sigma) without performance compromise. Through theoretical analysis and empirical studies on a wide variety of data sets, we demonstrate KDX to be very effective. An earlier version of this paper appeared in the 2005 SIAM International Conference on Data Mining. This version differs from the previous submission in providing a detailed cost analysis under different scenarios, specifically designed to meet the varying needs of accuracy, speed, and space requirements, developing an approach for insertion and deletion of instances, presenting the specific computations as well as the geometric properties used in performing the same, and providing detailed algorithms for each of the operations necessary to create and use the index structure

[1]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[2]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[3]  Christos Faloutsos,et al.  The TV-tree: An index structure for high-dimensional data , 1994, The VLDB Journal.

[4]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[5]  Christopher J. C. Burges,et al.  Geometry and invariance in kernel based methods , 1999 .

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[8]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[9]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[10]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[13]  Edward Y. Chang,et al.  CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines , 2003, IEEE Trans. Circuits Syst. Video Technol..

[14]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[15]  Edward Y. Chang,et al.  Exploiting Geometry for Support Vector Machine Indexing , 2005, SDM.

[16]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[19]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[20]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[21]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[22]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[23]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[24]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[26]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[27]  Michael S. Brown,et al.  Support Vector Machine Classi cation ofMicroarray Gene Expression DataUCSC-CRL-99-09 , 1999 .

[28]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[29]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..