A new linear regression model for histogram-valued variables

In classical data analysis, each individual takes one single “value” on each descriptive variable. Symbolic Data Analysis ([Bock and Diday (2000)], [Billard and Diday (2007)]) generalizes this framework by allowing each individual or class of individuals to take a finite set of values (quantitative multivalued variables), a finite set of categories (qualitative multi-valued variables), an interval (intervalvalued variable) or a distribution on each variable (modal-valued variables). A special case of these latter is when the distribution, for all observations of the modal-valued variable, is given by depicting the probabilities/ frequencies of observations occurring in certain ranges of values we say then that we are in presence of a histogram-valued variable. Interval-valued variables may be seen as a particular case of the histogram-valued variables if for all observations we have only one interval with probability/frequency one. The variable Y is a random histogram-valued variable if to each observation j, Y (j) corresponds a probability or frequency distribution that can be represented by the histogram ([Bock and Diday (2000)]):