Variable selection in model-based clustering using multilocus genotype data

We propose a variable selection procedure in model-based clustering using multilocus genotype data. Indeed, it may happen that some loci are not relevant for clustering into statistically different populations. Inferring the number K of clusters and the relevant clustering subset S of loci is seen as a model selection problem. The competing models are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we prove the consistency of the resulting estimator $${(\widehat{K}_n, \widehat{S}_n)}$$. An associated algorithm named Mixture Model for Genotype Data (MixMoGenD) has been implemented using c++ programming language and is available on http://www.math.u-psud.fr/~toussile. To avoid an exhaustive search of the optimum model, we propose a modified Backward-Stepwise algorithm, which enables a better search of the optimum model among all possible cardinalities of S. We present numerical experiments on simulated and real datasets that highlight the interest of our loci selection procedure.

[1]  Sophie Ancelet,et al.  Bayesian Clustering Using Hidden Markov Random Fields in Spatial Population Genetics , 2006, Genetics.

[2]  Qun Liu,et al.  Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of stock–recruitment relationships , 2006 .

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  Xue-xin Chen,et al.  Population genetic structure of Chilo suppressalis (Walker) (Lepidoptera: Crambidae): strong subdivision in China inferred from microsatellite markers and mtDNA gene sequences , 2008, Molecular ecology.

[5]  Guha Dharmarajan,et al.  Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation , 2006, Conservation Genetics.

[6]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[7]  C. Matias,et al.  Identifiability of parameters in latent structure models with many observed variables , 2008, 0809.5032.

[8]  Christophe Biernacki,et al.  Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models , 2003, Comput. Stat. Data Anal..

[9]  E. Gassiat Likelihood ratio inequalities with applications to various mixtures , 2002 .

[10]  G. Celeux,et al.  Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[11]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[12]  M W Feldman,et al.  Distinctive genetic signatures in the Libyan Jews. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Signe Normand,et al.  Landscape genetics, historical isolation and cross‐Andean gene flow in the wax palm, Ceroxylon echinulatum (Arecaceae) , 2008, Molecular ecology.

[14]  Jukka Corander,et al.  Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations , 2008, BMC Bioinformatics.

[15]  Olivier François,et al.  fastruct: model‐based clustering made faster , 2006 .

[16]  Elizabeth S. Allman,et al.  Identifiability of latent class models with many observed variables , 2008 .

[17]  Arnaud Estoup,et al.  Geneland: a computer package for landscape genetics , 2005 .

[18]  Jean-Marc Azaïs,et al.  The likelihood ratio test for general mixture models with or without structural parameter , 2009 .

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Didier Fontenille,et al.  Population genetic structure of Plasmodium falciparum in the two main African vectors, Anopheles gambiae and Anopheles funestus , 2007, Proceedings of the National Academy of Sciences.

[21]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[22]  Aurélien Garivier,et al.  A mdl approach to hmm with Poisson and Gaussian emissions. Application to order identification , 2005 .

[23]  Lisa Mirabello,et al.  Microsatellite data suggest significant population structure and differentiation within the malaria vector Anopheles darlingi in Central and South America , 2008, BMC Ecology.