论文信息 - Selective mapping: a discrete optimization approach to selecting a population subset for use in a high-density genetic mapping project

Selective mapping: a discrete optimization approach to selecting a population subset for use in a high-density genetic mapping project

We study the problem of sampling from a large genetic mapping population in which all individuals have identical pedigrees. We show that samples obtained from large populations, selected on the basis of limited genetic data, are better suited for use in high-density mapping experiments than random samples of the same size. We model the problem of choosing a mapping sample as a discrete stochastic optimization problem, related to existing clustering problems, and study various heuristics for the problem, including some randomized rounding algorithms. Experiments on both simulated data and ten data sets from biological populations show that these heuristics perform very well in practice despite the problem being NP-hard to approximate to within any constant. Our proposals offer the possibility of higher resolution, less expensive genetic maps. 1 I n t r o d u c t i o n A genetic map is a statement about the linear ordering and relative positions of genetic loci in the genome of a particular organism. In the last two decades, extensive genetic maps have been developed for a wide variety of organisms. Such maps have many important applications in human and veterinary medicine, plant and animal breeding, and in various aspects of basic biological research [18]. Constructing a very high-density whole-genome map may require determining the location of tens of thousands of markers, cost millions of dollars, and take several years (e.g. [24]). Current methodology calls for the use of a large randomly sampled mapping population. In this research, we propose a change in methodology that allows investigators to obtain a greater quantity of useful information for a given level of experimental effort. Our proposal is to divide the laboratory work ----"-~uman©cs. c o r n e l l , edu. Department of Computer Science, Cornell University, Ithaca, NY 14853. Research supported by an NSF Graduate Research Fellowship, NSF grants CCR-970029 and DMS-9805602, ONR grant N0014-96-1-00500, and the UPS Foundation. t tv23©corne l l . edu . USDA-ARS Center for Bioinformatics and Comparat ive Genomics, Cornell University, Ithaca, NY 14853. Research supported by NSF grant DBI-98-72617. t sd t4©corne l l . edu . Department of Plant Breeding, Cornell University, I thaca, NY 14853. Research supported by NSF grant DBI-98-72617. into two phases. In the first phase, laboratory data are collected from a large random population for a fraction of the total number of markers. In the second phase, based on the solution to a discrete stochastic optimization problem whose input is derived from this first step, da ta for the remainder of the markers are collected on only a selected sample of the population. In modeling the sample selection problem, we are first led to a deterministic discrete optimization problem, which is a non-metric variant of the k-center problem. This problem is not useful for real populations, since it requires a degree of experimental exactness which is not feasible, but we use it to create methods useful for a more realistic stochastic data model. In the deterministic problem, for each population member i, we represent its genome by the interval from 0 to A, where A is the length of the genome, and we are given a set C~ of discrete points in that interval which represent recombination sites. We seek the subset of size k from the n population members which minimizes the maximum length interval between consecutive points in the union of the Ci for the chosen population members. A simple 2-approximation algorithm is known for the metric k-center problem, while the non-metric kcenter problem is NP-hard to approximate to within any constant [9, 8]. We show that despite the structure in this problem, it is also NP-hard to approximate within any constant factor. However, we have developed a large number of lower and upper bound heuristics, based on recent work in clustering problems of this sort [4], which yield demonstrably excellent results for this problem in practice. Based on our experience with the deterministic problem, we consider a discrete stochastic optimization problem which bet ter models realistic biological data, and for which extensions of our heuristics also perform well. We also use linear programming in a novel way to derive a lower bound to the stochastic optimization problem. We believe that this new linear programming approach will have application in other contexts as well. Culling mapping populations to a desired size has been done for some time due to equipment capacity