Nonparametric methods for learning and detecting multivariate statistical dissimilarity. (Méthodes non-paramétriques pour l'apprentissage et la détection de dissimilarité statistique multivariée)

In this thesis, we study problems related to learning and detecting multivariate statistical dissimilarity, which are of paramount importance for many statistical learning methods nowadays used in an increasingly number of fields. This thesis makes three contributions related to these problems. The first contribution introduces a notion of multivariate nonparametric effect size shedding light on the nature of the dissimilarity detected between two datasets. Our two step method first decomposes a dissimilarity measure (Jensen-Shannon divergence) aiming at localizing the dissimilarity in the data embedding space, and then proceeds by aggregating points of high discrepancy and in spatial proximity into clusters. The second contribution presents the first sequential nonparametric two-sample test. That is, instead of being given two sets of observations of fixed size, observations can be treated one at a time and, when strongly enough evidence has been found, the test can be stopped, yielding a more flexible procedure while keeping guaranteed type I error control. Additionally, under certain conditions, when the number of observations tends to infinity, the test has a vanishing probability of type II error. The third contribution consists in a sequential change detection test based on two sliding windows on which a two-sample test is performed, with type I error guarantees. Our test has controlled memory footprint and, as opposed to state-of-the-art methods that also provide type I error control, has constant time complexity per observation, which makes our test suitable for streaming data.