Information Theory of Mixed Population Genome-Wide Association Studies

Genome-Wide Association Study (GWAS) addresses the problem of associating subsequences of individuals' genomes to the observable characteristics called phenotypes. In a genome of length G, it is observed that each characteristic is only related to a specific subsequence of it with length L, called the causal subsequence. The objective is to recover the causal subsequence, using a dataset of N individuals' genomes and their observed characteristics. Recently, the problem has been investigated from an information theoretic point of view in [1]. It has been shown that there is a threshold effect for reliable learning of the causal subsequence at $\displaystyle \frac {Gh(L/G)}{N}$ by characterizing the capacity of it. Here h denotes the binary entropy function. However, it is assumed that the dataset is collected from one population and the problem of mixed population datasets is not considered in [1], which is observed in many practical settings. In this paper, we study the mixed population version of GWAS, where we assume that the dataset is gathered from K subpopulations, rather than one. Each subpopulation has a specific causal subsequence for the observed characteristic and the subpopulation origins of individuals are latent. The objective is to recover all the causal subsequences with high accuracy. We investigate the fundamental limits of mixed population GWAS and characterize its capacity. It is observed that for a special class of two subpopulations, the capacity is one-fourth of the capacity of unmixed population case with the same parameters. Also, the capacity of this problem has connections to the capacity region of the Multiple Access Channel (MAC).

[1]  Kannan Ramchandran,et al.  Fundamental limits of DNA storage systems , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[2]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[3]  David Tse,et al.  Reference-based DNA shotgun sequencing: Information theoretic limits , 2013, 2013 IEEE International Symposium on Information Theory.

[4]  Yuhong Yang Elements of Information Theory (2nd ed.). Thomas M. Cover and Joy A. Thomas , 2008 .

[5]  D. Clayton,et al.  Genome-wide association study and meta-analysis finds over 40 loci affect risk of type 1 diabetes , 2009, Nature Genetics.

[6]  Mark I McCarthy,et al.  Genome-wide association studies in type 2 diabetes , 2009, Current diabetes reports.

[7]  R. Eeles,et al.  Genome-wide association studies in cancer. , 2008, Human molecular genetics.

[8]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[9]  Mohammad Ali Maddah-Ali,et al.  Genome-Wide Association Studies: Information Theoretic Limits of Reliable Learning , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[10]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[11]  Ilan Shomorony,et al.  Biological Applications of Information Theory in Honor of Claude Shannon ’ s Centennial — Part 2 Fundamental Limits of Genome Assembly Under an Adversarial Erasure Model , 2018 .

[12]  Seyed Abolfazl Motahari,et al.  Statistical Association Mapping of Population-Structured Genetic Data , 2016 .

[13]  Kannan Ramchandran,et al.  Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads , 2013, 2013 IEEE International Symposium on Information Theory.

[14]  David Tse,et al.  Information Theory of DNA Shotgun Sequencing , 2012, IEEE Transactions on Information Theory.