论文信息 - Variable selection in model-based clustering using multilocus genotype data

Variable selection in model-based clustering using multilocus genotype data

We propose a variable selection procedure in model-based clustering using multilocus genotype data. Indeed, it may happen that some loci are not relevant for clustering into statistically different populations. Inferring the number K of clusters and the relevant clustering subset S of loci is seen as a model selection problem. The competing models are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we prove the consistency of the resulting estimator $${(\widehat{K}_n, \widehat{S}_n)}$$. An associated algorithm named Mixture Model for Genotype Data (MixMoGenD) has been implemented using c++ programming language and is available on http://www.math.u-psud.fr/~toussile. To avoid an exhaustive search of the optimum model, we propose a modified Backward-Stepwise algorithm, which enables a better search of the optimum model among all possible cardinalities of S. We present numerical experiments on simulated and real datasets that highlight the interest of our loci selection procedure.

Elisabeth Gassiat | Wilson Toussile | E. Gassiat | Wilson Toussile

[1] Sophie Ancelet,et al. Bayesian Clustering Using Hidden Markov Random Fields in Spatial Population Genetics , 2006, Genetics.

[2] Qun Liu,et al. Comparison of Akaike information criterion (AIC) and Bayesian information criterion (BIC) in selection of stock–recruitment relationships , 2006 .

[3] H. Akaike,et al. Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4] Xue-xin Chen,et al. Population genetic structure of Chilo suppressalis (Walker) (Lepidoptera: Crambidae): strong subdivision in China inferred from microsatellite markers and mtDNA gene sequences , 2008, Molecular ecology.

[5] Guha Dharmarajan,et al. Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation , 2006, Conservation Genetics.

[6] A. Raftery,et al. Variable Selection for Model-Based Clustering , 2006 .

[7] C. Matias,et al. Identifiability of parameters in latent structure models with many observed variables , 2008, 0809.5032.

[8] Christophe Biernacki,et al. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models , 2003, Comput. Stat. Data Anal..

[9] E. Gassiat. Likelihood ratio inequalities with applications to various mixtures , 2002 .

[10] G. Celeux,et al. Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[11] P. Donnelly,et al. Inference of population structure using multilocus genotype data. , 2000, Genetics.

[12] M W Feldman,et al. Distinctive genetic signatures in the Libyan Jews. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13] Signe Normand,et al. Landscape genetics, historical isolation and cross‐Andean gene flow in the wax palm, Ceroxylon echinulatum (Arecaceae) , 2008, Molecular ecology.

[14] Jukka Corander,et al. Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations , 2008, BMC Bioinformatics.

[15] Olivier François,et al. fastruct: model‐based clustering made faster , 2006 .

[16] Elizabeth S. Allman,et al. Identifiability of latent class models with many observed variables , 2008 .

[17] Arnaud Estoup,et al. Geneland: a computer package for landscape genetics , 2005 .

[18] Jean-Marc Azaïs,et al. The likelihood ratio test for general mixture models with or without structural parameter , 2009 .

[19] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20] Didier Fontenille,et al. Population genetic structure of Plasmodium falciparum in the two main African vectors, Anopheles gambiae and Anopheles funestus , 2007, Proceedings of the National Academy of Sciences.

[21] P. Massart,et al. Concentration inequalities and model selection , 2007 .

[22] Aurélien Garivier,et al. A mdl approach to hmm with Poisson and Gaussian emissions. Application to order identification , 2005 .

[23] Lisa Mirabello,et al. Microsatellite data suggest significant population structure and differentiation within the malaria vector Anopheles darlingi in Central and South America , 2008, BMC Ecology.