Bases, Bits and Disease: Bases, bits and disease: a mathematical theory of human genetics

T he field of human genetics has recently become overwhelmed with the need for information management and analysis. Genome-wide association studies, transcriptional profiling and proteomics have all contributed greatly to the increased need for expertise from the information sciences. One significant challenge for human genetics is to identify those genetic, genomic, proteomic and environmental factors that increase or decrease susceptibility to disease. This is a difficult challenge due to the enormous volume of information that is often noisy and the complexity of the genotype-to-phenotype mapping relationship that arises from nonlinear phenomena such as epistasis or gene– gene interaction. It is increasingly clear that an analytical retooling to confront both the volume and complexity of the data is critical for moving the field forward. The parametric statistical paradigm has served us well over the years but now is the time to explore a wide range of different analytical methods for complex problem solving from a diversity of fields such as computer science, mathematics, etc. The paper by Dong et al on page 229 of this issue explores the use of information theory for the genetic analysis of complex human diseases. Information theory was launched as a formal discipline in 1948 with the publication of Shannon’s two papers on ‘A Mathematical Theory of Communication’. Shannon at that time worked for Bell Labs and was interested in mathematical methods for encoding information for transmission through electronic signals. The basic problem is to encode a message, transmit it as a signal, receive it and decode it with minimal noise such that the original message is not lost. It was in these seminal papers that Shannon introduced and defined entropy as a measure of uncertainty to help with maximizing the efficiency and accuracy of encoding, sending, receiving and decoding a message. Consider the following simple genetic example as a means for explaining entropy. Assume that 100 cases with a particular disease and 100 healthy controls were sampled from a population for an association study and a bi-allelic single-nucleotide polymorphism (SNP) in a particular candidate gene is genotyped. A fundamental question is whether any information about who is a case and who is a control can be obtained from knowledge about genotype. Let us assume that the AA and aa genotypes are each represented in 25 cases and 25 controls, while the Aa genotype is represented in 50 cases and 50 controls. With this set of data it is clear that no information about case– control status can be gained by looking at the genotypes and thus there is maximum uncertainty. With a binary outcome like case–control status entropy is simply [ p log2 p] [(1 p) log2(1 p)], where p is the probability of being a case (p) within a genotype and (1 p) is the probability of being a control within a genotype. Thus, for our simple example, the entropy for genotype AA would be [(0.5* 1)þ (0.5* 1)]1⁄41. Consider the example where the AA genotype is represented in 50 cases and 0 controls. In this case the entropy would be 0, indicating maximum knowledge about who is a case and who is a control. Another important concept is information, which is simply 1 entropy. In our first example, information is 0 (the minimum) and in our second example information1⁄41 (the maximum). Using the communications analogy, the case–control status (1s and 0s) of each individual in the data set forms a message that could be sent to a collaborator, for example. The 1s and 0s in the message could be coded as genotypes AA, Aa and aa, and transmitted via e-mail, for example, to the collaborator along with a genetic model that assigns highand low risk to each genotype. The collaborator could then decode each individual’s genotype and re-assign case–control status using the genetic risk model as a key. The entropy or uncertainty of these assignments could be used as a measure of the quality of the information encoding and the quality of the transmission, which might be susceptible to the introduction of noise from random events or systematic errors. Parameterization of genetic association studies in this manner is useful because there is a nice foundation of mathematical theory for information sciences and there are many useful data analysis tools that make use of entropy and other information-based measures. In addition, casting the problem as an information theory problem provides a parallax view that is often very useful when confronting a complex problem. The paper by Dong et al introduces the Entropy-Based SNP– SNP Interaction Method (ESNP2) for detecting and modeling epistasis in genetic association studies. This approach consists of three steps. First, the entropy of the data set (cases and controls) is computed as a baseline. Second, the information gained about case–control status from knowledge of the genotypes at each single SNP is estimated. Third, the information gained about case–control status from knowledge about genotypes gained by combining two SNPs as a Cartesian product is estimated. This final step provides a measure of the interaction information for any pair of SNPs and thus can be used to assess the presence of epistasis in a genetic association study. The concept of interaction information was described by McGill and has been rediscovered every few years since. Jakulin and Bratko provide the most recent comprehensive evaluation of information theory to detect, visualize and interpret nonadditive News and Commentary