Rank-Based Similarity Index (RBSI) in a Multidimensional DataSet

When exploring a data set, we generally use a distance to evaluate the similarity or dissimilarity between data. In a multidimensional space, usual distances combine the values of the variables. This approach has two significant drawbacks. First, the variables have neither the same unit nor the same scale. That requires standardization of variables before computing a distance. Second, some variables could be irrelevant to assess the similarity between data. This paper proposes to build a new similarity index based on data rankings. The index is called Rank-Based Similarity Index (RBSI). The goal is to use RBSI instead of the standard distances to avoid their drawbacks. The build of RBSI is based on three steps. The first step defines a similarity function for each data and each variable. Each function is based on the rankings of data. The second step computes the mean of similarity values to define two characteristics for each variable. These characteristics are called sensitivity and specificity which assess the relevance of a variable for evaluating the similarity. The third step aggregates the values of the similarity functions to define RBSI by an ordered weighted averaging (OWA) [3]. The weights of the OWA operator then integrate the relevant characteristics of the variables. Finally, we compare RBSI to the usual distances: RBSI gives better results to assess the similarity between the data.