Covariance estimation for vertically partitioned data in a distributed environment

The major sources of abundant data is constantly expanding with the available data collection methodologies in various applications—medical, insurance, scientific, bio-informatics and business. These data sets may be distributed geographically, rich in size and as well as dimensions also. To analyze these data sets to find out the hidden patterns, it is required to download the data to a centralized site which is a challenging task in terms of the limited bandwidth available and computationally also expensive. The covariance matrix is one of the method to estimate the relation between any two dimensions. In this paper we propose a communication efficient algorithm to estimate the covariance matrix in a distributed manner. The global covariance matrix is computed by merging the local covariance matrices using a distributed approach. The results show that it is exactly same as centralized method with good speed-up in terms of computation. The reason for speed-up is because of the parallel construction of local covariances and distributing the cross covariances among the nodes so that the load is balanced. The results are analyzed by considering Mfeat data set on the various partitions which addresses the scalability also.