HIGH-D DATA VISUALIZATION METHODS VIA PROBABILISTIC PRINCIPAL SURFACES FOR DATA MINING APPLICATIONS

One of the central problems in pattern recognition is that of input data probability density function estimation (pdf), i.e., the construction of a model of a probability distribution given a finite sample of data drawn from that distribution. Probabilistic Principal Surfaces (hereinafter PPS) is a nonlinear latent variable model providing a way to accomplish pdf estimation, and possesses two attractive aspects useful for a wide range of data mining applications: (1) visualization of high dimensional data and (2) their classification. PPS generates a non linear manifold passing through the data points defined in terms of a number of latent variables and of a nonlinear mapping from latent space to data space. Depending upon dimensionality of the latent space (usually at most 3−dimensional) one has 1−D, 2 − D or 3 − D manifolds. Among the 3 − D manifolds, PPS permits to build a spherical manifold where the latent variables are uniformly arranged on a unit sphere. This particular form of the manifold provides a very effective tool to reduce the problems deriving from curse of dimensionality when data dimension increases. In this paper we concentrate on PPS used as a visualization tool proposing a number of plot options and showing its effectiveness on two complex astronomical data sets.