Similarity measures for multidimensional data

How similar are two data-cubes? In other words, the question under consideration is: given two sets of points in a multidimensional hierarchical space, what is the distance value between them? In this paper we explore various distance functions that can be used over multidimensional hierarchical spaces. We organize the discussed functions with respect to the properties of the dimension hierarchies, levels and values. In order to discover which distance functions are more suitable and meaningful to the users, we conducted two user study analysis. The first user study analysis concerns the most preferred distance function between two values of a dimension. The findings of this user study indicate that the functions that seem to fit better the user needs are characterized by the tendency to consider as closest to a point in a multidimensional space, points with the smallest shortest path with respect to the same dimension hierarchy. The second user study aimed in discovering which distance function between two data cubes, is mostly preferred by users. The two functions that drew the attention of users where (a) the summation of distances between every cell of a cube with the most similar cell of another cube and (b) the Hausdorff distance function. Overall, the former function was preferred by users than the latter; however the individual scores of the tests indicate that this advantage is rather narrow.

[1]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Ulf Leser,et al.  Describing differences between databases , 2006, CIKM '06.

[3]  Arnaud Giacometti,et al.  Query recommendations for OLAP discovery driven analysis , 2009, DOLAP.

[4]  Simone Santini,et al.  Similarity Matching , 1995, ACCV.

[5]  David McLean,et al.  An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[7]  Sunita Sarawagi,et al.  User-Adaptive Exploration of Multidimensional Data , 2000, VLDB.

[8]  Cliff Joslyn,et al.  Evaluating the Structural Quality of Semantic Hierarchy Alignments , 2008, SEMWEB.

[9]  Sunita Sarawagi,et al.  iDiff: Informative Summarization of Differences in Multidimensional Aggregates , 2001, Data Mining and Knowledge Discovery.

[10]  Sunita Sarawagi,et al.  Explaining Differences in Multidimensional Aggregates , 1999, VLDB.

[11]  Peter Sanders,et al.  Highway Hierarchies Hasten Exact Shortest Path Queries , 2005, ESA.

[12]  Panos Vassiliadis,et al.  Modelling and Optimisation Issues for Multidimensional Databases , 2000, CAiSE.

[13]  Simone Santini,et al.  Similarity Measures , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[15]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).