Dimensionality Reduction for Categorical Data

Categorical attributes are those that can take a discrete set of values, e.g., colours. This work is about compressing vectors over categorical attributes to low-dimension discrete vectors. The current hash-based methods compressing vectors over categorical attributes to low-dimension discrete vectors do not provide any guarantee on the Hamming distances between the compressed representations. Here we present FSketch to create sketches for sparse categorical data and an estimator to estimate the pairwise Hamming distances among the uncompressed data only from their sketches. We claim that these sketches can be used in the usual data mining tasks in place of the original data without compromising the quality of the task. For that, we ensure that the sketches also are categorical, sparse, and the Hamming distance estimates are reasonably precise. Both the sketch construction and the Hamming distance estimation algorithms require just a single-pass; furthermore, changes to a data point can be incorporated into its sketch in an efficient manner. The compressibility depends upon how sparse the data is and is independent of the original dimension – making our algorithm attractive for many real-life scenarios. Our claims are backed by rigorous theoretical analysis of the properties of FSketch and supplemented by extensive comparative evaluations with related algorithms on some real-world datasets. We show that FSketch is significantly faster, and the accuracy obtained by using its sketches are among the top for the standard unsupervised tasks of RMSE, clustering and similarity search.

[1]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[2]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[3]  David J. Miller,et al.  Semi-supervised Multi-Label Topic Models for Document Classification and Sentence Labeling , 2016, CIKM.

[4]  Xin Zhang,et al.  The Maximum Separation Subspace in Sufficient Dimension Reduction with Categorical Response , 2020, J. Mach. Learn. Res..

[5]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[6]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[7]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[8]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[9]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[10]  Daniel M. Kane,et al.  Sparser Johnson-Lindenstrauss Transforms , 2010, JACM.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[13]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[14]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[15]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[16]  Rasmus Pagh,et al.  Efficient estimation for high similarities using odd sketches , 2014, WWW.

[17]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[18]  Raghav Kulkarni,et al.  Efficient Compression Technique for Sparse Sets , 2018, PAKDD.

[19]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[20]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[21]  Michael R. Lyu,et al.  Effective Data-Aware Covariance Estimator From Compressed Data , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[22]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[23]  H. Řezanková,et al.  Dimensionality Reduction of Categorical Data: Comparison of HCA and CATPCA Approaches , 2015 .

[24]  Rusins Freivalds,et al.  Probabilistic Machines Can Use Less Running Time , 1977, IFIP Congress.

[25]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[26]  Padhraic Smyth,et al.  Visualization of navigation patterns on a Web site using model-based clustering , 2000, KDD '00.

[27]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  M. Greenacre,et al.  Multiple Correspondence Analysis and Related Methods , 2006 .

[30]  R. Fisher THE STATISTICAL UTILIZATION OF MULTIPLE MEASUREMENTS , 1938 .

[31]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[32]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[33]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[34]  Matti Nykänen,et al.  Efficient Discovery of Statistically Significant Association Rules , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[35]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[36]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[37]  Feiping Nie,et al.  C2DNDA: A Deep Framework for Nonlinear Dimensionality Reduction , 2021, IEEE Transactions on Industrial Electronics.

[38]  Qi Wang,et al.  Discrimination-Aware Projected Matrix Factorization , 2020, IEEE Transactions on Knowledge and Data Engineering.

[39]  Vijay V. Raghavan,et al.  Min-Max Itemset Trees for Dense and Categorical Datasets , 2012, ISMIS.

[40]  Xingquan Zhu,et al.  Generalized Feature Embedding for Supervised, Unsupervised, and Online Learning Tasks , 2018, Information Systems Frontiers.

[41]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[42]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Volkan Cevher,et al.  Low-Dimensional Models for Dimensionality Reduction and Signal Recovery: A Geometric Perspective , 2010, Proceedings of the IEEE.

[44]  Lukasz A. Kurgan,et al.  Knowledge discovery approach to automated cardiac SPECT diagnosis , 2001, Artif. Intell. Medicine.

[45]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[48]  Thorsteinn S. Rögnvaldsson,et al.  State of the art prediction of HIV-1 protease cleavage sites , 2015, Bioinform..

[49]  Taghi M. Khoshgoftaar,et al.  Survey on categorical data for neural networks , 2020, Journal of Big Data.

[50]  Susan Holmes,et al.  Ten quick tips for effective dimensionality reduction , 2019, PLoS Comput. Biol..

[51]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[52]  Massih-Reza Amini,et al.  Learning to recommend diverse items over implicit feedback on PANDOR , 2018, RecSys.

[53]  Rameshwar Pratap,et al.  Efficient Sketching Algorithm for Sparse Binary Data , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[54]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[55]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[56]  Nikhil R. Pal,et al.  Unsupervised Feature Selection with Controlled Redundancy (UFeSCoR) , 2015, IEEE Transactions on Knowledge and Data Engineering.