Association Rules and Compositional Data Analysis: Implications to Big Data

Many modern organizations generate a large amount of transaction data, on a daily basis. Transactions typically include semantic descriptors that require specialised methods for analysis. Association rule (AR) mining is a powerful semantic data analytic technique used for extracting information from transaction databases and indicate what item goes with what item in a set of transactions. AR was originally developed for basket analysis where the combination of items in a shopping basket is evaluated to determine prevalence with impact of shelves layouts. To generate an AR, the collection of more frequent itemsets—a set of two of more items—must be detected. Then, as a second step, all possible ARs are generated form each itemset. The ARs are then ranked using measures of association labelled, in this context, “measures of interestingness”. The R package “arules” provides more than a dozen such measures including the relative linkage disequilibrium (RLD) which normalises classical Euclidean distances of the itemset from a surface of independence. In this work, we study AR and RLD from a compositional data (CoDa) perspective. It is well known that CoDa methodology provides nice properties such as subcompostional coherence and scalability. We explore here the implication of CoD to AR mining in big data analysis. The aim is to analyse if CoDa properties ensure that the AR characteristic is not scale dependent and that if we consider a subset of the original items, we still keep similar behaviour. The work focuses on such aspects, including the dynamic visualization of CoDa-AR measures on a simplex representation of the itemsets and its multidimensional extension.