Optimal subset selection methods

In many situations, the analyst must choose subsets of data prior to processing it. Some examples of these are in experimental design, robust estimation of multivariate location and scatter, and density estimation. The problem can be stated as follows: given a data set of size n, select a subset of these points of size h, where $h < n$, using a suitable selection criterion. In this dissertation, I develop several methods of selecting subsets of data, and I provide supporting theory for the algorithms. These methods will optimize the determinant, the trace or a single eigenvalue of the Fisher Information Matrix, and they are presented from two standpoints: when the data are centered at the sample mean and when they are not. I prove that the globally optimal solution is found when the trace or a single eigenvalue is optimized and when the data do not have to be re-centered at each iteration. Some of these methods require a large amount of computer time when extremely large data sets must be reduced. I develop a recursive form for the optimization of the determinant of the information matrix which can be used for centered or offset data. I show that this makes the method suitable for parallel implementation, and I provide an analysis of the parallel code as implemented on the Intel Paragon. Results show that the method, while not embarrassingly parallel, yields excellent speed-up and efficiency. A study of load balance and communication problems is provided. These methods are applied to two statistical applications: selecting the subset of data for the Minimum Volume Ellipse estimate of location and scatter and determining the number of groups and initial parameters in finite mixture densities. Results indicate that the subset selection methods described in this dissertation are very useful in these cases. Finally, I provide a discussion of other statistical problems where these algorithms can be applied.