On the Identifiability of Parameters in the Population Stratification Problem: A Worst-Case Analysis

In the problem of population stratification, each data instance is generated based on a finite mixture model with $K$ mixture components and $L$ observed variables. Each variable takes its value in a finite state space with cardinality M. The variables are drawn independently in each mixture component. In this paper, we study the problem of the identifiability of parameters in this model, i.e. interpolation of the parameters of a mixture model from its mixture distribution. First we define the notion of informative variables. Then, we prove that the parameters of the problem are identifiable in the worst-case regime, if and only if the number of informative variables is greater than or equal to 2K − 1. As a result, in the worst-case analysis of the identifiability problem of finite mixture models, the number of required informative variables is Θ(K) and it is independent of the state space size.

[1]  Seyed Abolfazl Motahari,et al.  Statistical Association Mapping of Population-Structured Genetic Data , 2016 .

[2]  Gérard Govaert,et al.  Estimation and selection for the latent block model on categorical data , 2015, Stat. Comput..

[3]  B. Rannala,et al.  Detecting immigration by using multilocus genotypes. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Alberto Piazza,et al.  The History and Geography of Human Genes: Abridged paperback Edition , 1996 .

[5]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[6]  P. Hall,et al.  An application of classical invariant theory to identifiability in nonparametric mixtures , 2005 .

[7]  J. Pritchard,et al.  Use of unlinked genetic markers to detect population stratification in association studies. , 1999, American journal of human genetics.

[8]  C. Matias,et al.  Identifiability of parameters in latent structure models with many observed variables , 2008, 0809.5032.

[9]  Roderick,et al.  Determining the source of individuals: multilocus genotyping in nonequilibrium population genetics. , 1999, Trends in ecology & evolution.

[10]  D. Balding,et al.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity , 2005, Genetica.

[11]  Alberto Piazza,et al.  The History and Geography of Human Genes . By L. Luca Cavalli-Sforza Paolo Menozzi and Alberto Piazza. Princeton, New Jersey: Princeton University Press, 1994. xiii, 526 maps, 541 pp. text. $150.00. , 1995 .