On Submodular Set Cover Problems for Near-Optimal Online Kernel Basis Selection

Non-parametric function approximators provide a principled way to fit nonlinear statistical models while affording formal performance guarantees. However, their complexity drawbacks are well-understood: they define a statistical representation whose complexity scales with the sample size through the fact that they retain all past samples. In the case of streaming data, this complexity may grow unbounded. One is faced with the question of how to suitably trade off representational complexity with statistical accuracy, which may be addressed with various approximation methods. In this work, we formalize that greedy-based approximations, under suitably chosen compression statistics, can admit near-optimal representations. The key driver of this result is a novel connection between the reproducing kernel Hilbert Space (RKHS) norm and the log-determinant of the kernel matrix, which has been shown to be a submodular set function of a collection of points. This allows us to design a constructive variant of a greedy subspace projections in [1], [2] according to a submodular set cover (SSC) problem, which provably picks at most logarithmically more elements than the optimal one. We validate our constructive approach by doing simulation on real ocean data from the Gulf of Mexico [3].