A parallel algorithm for subset selection

Prior to performing an analysis of a large data set, it is often desirable to process a subset of the data only. Current methods of subset selection choose points in a random manner, which can lead to poor solutions. The method for selection described in this paper employs the Effective Independence Distribution (EID) method that chooses observations that optimize the determinant of the information matrix. Since the method requires repeated calculations of three matrix multiplications and a matrix. inverse, it is computationally intensive for extremely large data sets. A recursive form of the EID is developed here which is suitable for parallelization. The parallel method is described in detail, and load balancing and communication issues are addressed. Implementation results on the Intel Paragon show that this is an effective parallel algorithm.