Sample Selection for Maximal Diversity

The problem of selecting a sample subset sufficient to preserve diversity arises in many applications. One example is in the design of recombinant inbred lines (RIL) for genetic association studies. In this context, genetic diversity is measured by how many alleles are retained in the resulting inbred strains. RIL panels that are derived from more than two parental strains, such as the collaborative cross (Churchill et al., 2004), present a particular challenge with regard to which of the many existing lab mouse strains should be included in the initial breeding funnel in order to maximize allele retention. A similar problem occurs in the study of customer reviews when selecting a subset of products with a maximal diversity in reviews. Diversity in this case implies the presence of a set of products having both positive and negative ranks for each customer. In this paper, we demonstrate that selecting an optimal diversity subset is an NP-complete problem via reduction to set cover. This reduction is sufficiently tight that greedy approximations to the set cover problem directly apply to maximizing diversity. We then suggest a slightly modified subset selection problem in which an initial greedy diversity solution is used to effectively prune an exhaustive search for all diversity subsets bounded from below by a specified coverage threshold. Extensive experiments on real datasets are performed to demonstrate the effectiveness and efficiency of our approach.

[1]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[2]  Terry Magnuson,et al.  Genetic and haplotype diversity among wild-derived mouse inbred strains. , 2004, Genome research.

[3]  Nengjun Yi,et al.  The Collaborative Cross, a community resource for the genetic analysis of complex traits , 2004, Nature Genetics.

[4]  Mike Steel,et al.  Maximizing phylogenetic diversity in biodiversity conservation: Greedy solutions to the Noah's Ark problem. , 2006, Systematic biology.

[5]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[6]  Chunfang Jin,et al.  Selective Phenotyping for Increased Efficiency in Genetic Mapping Studies , 2004, Genetics.

[7]  Joel Parker,et al.  Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows , 2007, ISMB/ECCB.

[8]  Fei Zou,et al.  Improving Quantitative Trait Loci Mapping Resolution in Experimental Crosses by the Use of Genotypically Selected Samples , 2005, Genetics.

[9]  Lu Lu,et al.  The genetic structure of recombinant inbred mice: high-resolution consensus maps for complex trait analysis , 2001, Genome Biology.

[10]  Mike Steel,et al.  Phylogenetic diversity and the greedy algorithm. , 2005, Systematic biology.

[11]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[12]  D. Hochbaum,et al.  Analysis of the greedy approach in problems of maximum k‐coverage , 1998 .

[13]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[14]  Robert W. Williams,et al.  Genetic dissection of complex and quantitative traits: from fantasy to reality via a community effort , 2002, Mammalian Genome.

[15]  Kenneth Y. Goldberg,et al.  Eigentaste: A Constant Time Collaborative Filtering Algorithm , 2001, Information Retrieval.