Random Projection Through the Lens of Data Complexity Indicators

Random projection (RP) is a simple and efficient tool for dimensionality reduction and has been applied in many machine learning algorithms. Theoretical guarantees ensure that the method approximately preserves the pairwise Euclidean distances with high probability. However, worst-case theoretical bounds on the required target dimension of the projection are not sufficiently tight for direct practical use, so for a large data set most often much lower values are being used in practice. While theory also suggests that this may be due to the low complexity structure of the underlying data support, this structure is typically unknown in practice. Here we investigate the use of a previously proposed battery of data complexity indicators (DCIs) to gain insight into how random projection changes the data complexity for classification problems. Our experimental results show that complexity increases with compression, but in most cases we can project data onto fairly low dimensions while maintaining nearly the same level of data complexity as that of the original data. This may offer some guidance for the choice of the target dimension in practice, although it should be interpreted with caution.