Genome-Wide Association Studies: Information Theoretic Limits of Reliable Learning

In the problems of Genome-Wide Association Study (GWAS), the objective is to associate subsequences of individual's genomes to the observable characteristics called phenotypes. The genome containing the biological information of an individual can be represented by a sequence of length $G$. Many observable characteristics of the individuals can be related to a subsequence of a given length $L$, called causal subsequence. The environmental affects make the relation between the causal subsequence and the observable characteristics a stochastic function. Our objective in this paper is to detect the causal subsequence of a specific phenotype using a dataset of $N$ individuals and their observed characteristics. We introduce an abstract formulation of GWAS which allows us to investigate the problem from an information theoretic perspective. In particular, as the parameters $N, G$, and $L$ grow, we observe a threshold effect at $\frac{Gh{(L/G)}}{N}$, where $h(.)$ is the binary entropy function. This effect allows us to define the capacity of recovering the causal subsequence by denoting the rate of the GWAS problem as $\frac{Gh(L/G)}{N}$. We develop an achievable scheme and a matching converse for this problem, and thus characterize its capacity in two scenarios: the zero-error-rate and the $\epsilon$ -error-rate.

[1]  Helen Schuilenburg,et al.  Information for : Genome-wide association study and meta-analysis indicates that over 40 loci affect risk of type 1 diabetes , 2009 .

[2]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[3]  W. Willett,et al.  Multiple loci identified in a genome-wide association study of prostate cancer , 2008, Nature Genetics.

[4]  Ilan Shomorony,et al.  Biological Applications of Information Theory in Honor of Claude Shannon ’ s Centennial — Part 2 Fundamental Limits of Genome Assembly Under an Adversarial Erasure Model , 2018 .

[5]  David Tse,et al.  Reference-based DNA shotgun sequencing: Information theoretic limits , 2013, 2013 IEEE International Symposium on Information Theory.

[6]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[7]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[8]  Mark I McCarthy,et al.  Genome-wide association studies in type 2 diabetes , 2009, Current diabetes reports.

[9]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[10]  Kannan Ramchandran,et al.  Optimal DNA shotgun sequencing: Noisy reads are as good as noiseless reads , 2013, 2013 IEEE International Symposium on Information Theory.

[11]  David Tse,et al.  Information Theory of DNA Shotgun Sequencing , 2012, IEEE Transactions on Information Theory.

[12]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[13]  Todd A. Johnson,et al.  Genome-wide association study identifies three novel loci for type 2 diabetes. , 2014, Human molecular genetics.

[14]  R. Eeles,et al.  Genome-wide association studies in cancer. , 2008, Human molecular genetics.

[15]  Seyed Abolfazl Motahari,et al.  Statistical Association Mapping of Population-Structured Genetic Data , 2016, bioRxiv.