Multiple Linear Regression for Histogram Data using Least Squares of Quantile Functions: a Two-components model

Histograms are commonly used for representing summaries of observed data and they can be considered non parametric estimates of probability distributions. Symbolic Data Analysis formalized the concept of histogram symbolic variable, as a variable which allows to describe statistical units by histograms instead of single values. In this paper we present a linear regression model for multivariate histogram variables. We use a Least Square estimation method where the sum of squared errors is defined according to the `2 Wasserstein metric between the observed and the predicted histogram data. Consistently with the l2 Wasserstein metric, we solve the Least Square computational problem by introducing a suitable inner product between two vectors of histogram data. Finally, measures of goodness of fit are discussed and an application on real data shows some interpretative advantages of the proposed method.

[1]  Edwin Diday,et al.  Symbolic Data Analysis: Conceptual Statistics and Data Mining (Wiley Series in Computational Statistics) , 2007 .

[2]  Carlos Matrán,et al.  Optimal Transportation Plans and Convergence in Distribution , 1997 .

[3]  P. Bertrand,et al.  Descriptive Statistics for Symbolic Data , 2000 .

[4]  Antonio Irpino,et al.  Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation , 2007, EGC.

[5]  J. Carretero,et al.  Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean Coast. , 2002, The Science of the total environment.

[6]  Antonio Irpino,et al.  Ordinary Least Squares for Histogram Data Based on Wasserstein Distance , 2010, COMPSTAT.

[7]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[8]  Monique Noirhomme-Fraiture,et al.  Symbolic Data Analysis and the SODAS Software , 2008 .

[9]  K. Pearson Contributions to the Mathematical Theory of Evolution , 1894 .

[10]  A new linear regression model for histogram-valued variables , 2011 .

[11]  Francisco de A. T. de Carvalho,et al.  Constrained linear regression models for symbolic interval-valued variables , 2010, Comput. Stat. Data Anal..

[12]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .

[13]  Y. Lechevallier,et al.  Dynamic clustering of histograms using Wasserstein metric , 2006 .

[14]  K. Pearson Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material , 1895 .

[15]  Antonio Irpino,et al.  Dynamic Clustering of Histogram Data: Using the Right Metric , 2007 .

[16]  Antonio Irpino,et al.  Comparing Histogram Data Using a Mahalanobis–Wasserstein Distance , 2008 .