Exploring New Forms of Random Projections for Prediction and Dimensionality Reduction in Big-Data Regimes

The story of this work is dimensionality reduction. Dimensionality reduction is a method that takes as input a point-set P of n points in R where d is typically large and attempts to find a lower-dimensional representation of that dataset, in order to ease the burden of processing for down-stream algorithms. In today’s landscape of machine learning, researchers and practitioners work with datasets that either have a very large number of samples, and or include high-dimensional samples. Therefore, dimensionality reduction is applied as a pre-processing technique primarily to overcome the curse of dimensionality. Generally, dimensionality reduction improves time and storage space required for processing the point-set, removes multi-collinearity and redundancies in the dataset where different features may depend on one another, and may enable simple visualizations of the dataset in 2-D and 3-D making the relationships in the data easy for humans to comprehend. Dimensionality reduction methods come in many shapes and sizes. Methods such as Principal Component Analysis (PCA), Multi-dimensional Scaling, IsoMaps, and Locally Linear Embeddings are amongst the most commonly used method of this family of algorithms. However, the choice of dimensionality reduction method proves critical in many applications as there is no one-size-fits-all solution, and special care must be considered for different datasets and tasks. Furthermore, the aforementioned popular methods are data-dependent, and commonly rely on computing either the Kernel / Gram matrix or the covariance matrix of the dataset. These matrices scale with increasing number of samples and increasing number of data dimensions, respectively, and are consequently poor choices in today’s landscape of big-data applications. Therefore, it is pertinent to develop new dimensionality reduction methods that can be efficiently applied to large and high-dimensional datasets, by either reducing the dependency on the data, or side-stepping it altogether. Furthermore, such new dimensionality reduction methods should be able to perform on par with, or better than, traditional methods such as PCA. To achieve this goal, we turn to a simple and powerful method called random projections. Random projections are a simple, efficient, and data-independent method for stably embedding a point-set P of n points in R to R where d is typically large and k is on the order of log n. Random projections have a long history of use in dimensionality reduction literature with great success. In this work we are inspired to build on the ideas of random projection theory, and extend the framework and build a powerful new setup of random projections for large high-dimensional datasets, with comparable performance to state-ofthe-art data-dependent and nonlinear methods. Furthermore, we study the use of random projections in domains other than dimensionality reduction, including prediction, and show

[1]  Ramandeep Kaur,et al.  Face recognition using Principal Component Analysis , 2015, 2015 IEEE International Advance Computing Conference (IACC).

[2]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[3]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[4]  Amir-Hossein Karimi,et al.  Synthesizing Deep Neural Network Architectures using Biological Synaptic Strength Distributions , 2017, ArXiv.

[5]  Katsuhiko Sakaue,et al.  Head pose estimation by nonlinear manifold learning , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[6]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  Zohreh Azimifar,et al.  Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds , 2011, Pattern Recognit..

[8]  Man Zhu,et al.  Understanding Deep Representations through Random Weights , 2017, ArXiv.

[9]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[10]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[11]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[12]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[14]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[15]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[16]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[17]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[18]  Alexander Wong,et al.  Random feature maps via a Layered Random Projection (LARP) framework for object classification , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[19]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[20]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[21]  Trevor Bekolay,et al.  A Large-Scale Model of the Functioning Brain , 2012, Science.

[22]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[23]  Guang-Bin Huang,et al.  Extreme Learning Machine for Multilayer Perceptron , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Hongming Zhou,et al.  Representational Learning with ELMs for Big Data , 2013 .

[25]  Sen Song,et al.  Highly Nonrandom Features of Synaptic Connectivity in Local Cortical Circuits , 2005, PLoS biology.

[26]  Yan Wang,et al.  A Powerful Generative Model Using Random Weights for the Deep Image Representation , 2016, NIPS.

[27]  S. Mallat A wavelet tour of signal processing , 1998 .

[28]  E. Candès The restricted isometry property and its implications for compressed sensing , 2008 .

[29]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory, Ser. B.

[30]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[31]  Yan Yang,et al.  Dimension Reduction With Extreme Learning Machine , 2016, IEEE Transactions on Image Processing.

[32]  Chee Kheong Siew,et al.  Universal Approximation using Incremental Constructive Feedforward Networks with Random Hidden Nodes , 2006, IEEE Transactions on Neural Networks.

[33]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[34]  J. Alonso,et al.  Complex Receptive Fields in Primary Visual Cortex , 2003, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[35]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[36]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[37]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[38]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[39]  Christos Faloutsos,et al.  A survey of information retrieval and filtering methods , 1995 .

[40]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[41]  Paul R. Martin,et al.  Cortical-Like Receptive Fields in the Lateral Geniculate Nucleus of Marmoset Monkeys , 2013, The Journal of Neuroscience.

[42]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[43]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[44]  Daniel Thalmann,et al.  Planar arrangement of high-dimensional biomedical data sets by isomap coordinates , 2003, 16th IEEE Symposium Computer-Based Medical Systems, 2003. Proceedings..

[45]  Kevin Baker,et al.  Classification of radar returns from the ionosphere using neural networks , 1989 .

[46]  J. DiCarlo,et al.  A High-throughput Screening Approach to Discovering Good Forms of Biologically-inspired Visual Representation. Text S2: Technical Details of the Computational Framework , 2009 .

[47]  Paul W. Fieguth,et al.  Texture Classification from Random Features , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Dianhui Wang,et al.  Extreme learning machines: a survey , 2011, Int. J. Mach. Learn. Cybern..

[49]  Heiga Zen,et al.  On the Use of Kernel PCA for Feature Extraction in Speech Recognition , 2003, IEICE Trans. Inf. Syst..

[50]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[51]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[52]  John C. Duchi,et al.  Learning Kernels with Random Features , 2016, NIPS.

[53]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[54]  J. Nadal,et al.  What can we learn from synaptic weight distributions? , 2007, Trends in Neurosciences.

[55]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[56]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[57]  S. Salzberg,et al.  A weighted nearest neighbor algorithm for learning with symbolic features , 2004, Machine Learning.

[58]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[59]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[60]  Nicolas Pinto,et al.  Beyond simple features: A large-scale feature search approach to unconstrained face recognition , 2011, Face and Gesture 2011.

[61]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Guido Sanguinetti,et al.  Dimensionality Reduction of Clustered Data Sets , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[64]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[65]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[66]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[67]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[69]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[70]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[71]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[72]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.