Comparing representative selection strategies for dissimilarity representations

Many of the computational intelligence techniques currently used do not scale well in data type or computational performance, so selecting the right dimensionality reduction technique for the data is essential. By employing a dimensionality reduction technique called representative dissimilarity to create an embedded space, large spaces of complex patterns can be simplified to a fixed‐dimensional Euclidean space of points. The only current suggestions as to how the representatives should be selected are principal component analysis, projection pursuit, and factor analysis. Several alternative representative strategies are proposed and empirically evaluated on a set of term vectors constructed from HTML documents. The results indicate that using a representative dissimilarity representation with at least 50 representatives can achieve a significant increase in classification speed, with a minimal sacrifice in accuracy, and when the representatives are selected randomly, the time required to create the embedded space is significantly reduced, also with a small penalty in accuracy. © 2006 Wiley Periodicals, Inc. Int J Int Syst 21: 1093–1109, 2006.

[1]  Abraham Kandel,et al.  Graph-Theoretic Techniques for Web Content Mining , 2005, Series in Machine Perception and Artificial Intelligence.

[2]  Hanan Samet,et al.  Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[6]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[7]  Jae-On Kim,et al.  Factor Analysis: Statistical Methods and Practical Issues , 1978 .

[8]  Robert P. W. Duin,et al.  Dissimilarity representations allow for building good classifiers , 2002, Pattern Recognit. Lett..

[9]  Vipin Kumar,et al.  Document Categorization and Query Generation on the World Wide Web Using WebACE , 1999, Artificial Intelligence Review.

[10]  Robert P. W. Duin,et al.  Possibilities of Zero-Error Recognition by Dissimilarity Representations , 2002, PRIS.

[11]  Robert P. W. Duin,et al.  Prototype selection for finding efficient representations of dissimilarity data , 2002, Object recognition supported by user interaction for service robots.

[12]  Fabio Roli,et al.  A note on core research issues for statistical pattern recognition , 2002, Pattern Recognit. Lett..

[13]  Christos Faloutsos,et al.  On the 'Dimensionality Curse' and the 'Self-Similarity Blessing' , 2001, IEEE Trans. Knowl. Data Eng..

[14]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .

[15]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[16]  David G. Stork,et al.  Pattern Classification , 1973 .

[17]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.