Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications.

So far, similarity/diversity of objects has been widely studied in different research fields and a number of distance measures to estimate diversity between objects have been proposed. However, not much interest has been addressed to analysis of how diverse are configurations of objects in two different multivariate spaces. Since computerisation and automation nowadays lead to a large availability of information, it is apparent that a system could be described in different ways and, consequently, methods for comparison of the different viewpoints are required. These methods, for instance, may be usefully applied to Quantitative Structure-Activity Relationship (QSAR) studies. In this field, several thousands of molecular descriptors have been proposed in the literature and different selections of descriptors define different chemical spaces that need to be compared. Moreover, variable selection techniques such as Genetic Algorithms, Simulated Annealing, and Tabu Search are widely used to process available information in order to select optimal QSAR models. When more than one optimal model results, the problem arising is how to compare these models to find out whether they are really diverse or based on descriptors explaining almost the same information. In this paper, novel indices are proposed to measure similarity/diversity between pairs of data sets by the aid of the variable cross-correlation matrix.