Information compression exploits patterns of genome composition to discriminate populations and highlight regions of evolutionary interest

BackgroundGenomic information allows population relatedness to be inferred and selected genes to be identified. Single nucleotide polymorphism microarray (SNP-chip) data, a proxy for genome composition, contains patterns in allele order and proportion. These patterns can be quantified by compression efficiency (CE). In principle, the composition of an entire genome can be represented by a CE number quantifying allele representation and order.ResultsWe applied a compression algorithm (DEFLATE) to genome-wide high-density SNP data from 4,155 human, 1,800 cattle, 1,222 sheep, 81 dogs and 49 mice samples. All human ethnic groups can be clustered by CE and the clusters recover phylogeography based on traditional fixation index (FST) analyses. CE analysis of other mammals results in segregation by breed or species, and is sensitive to admixture and past effective population size. This clustering is a consequence of individual patterns such as runs of homozygosity. Intriguingly, a related approach can also be used to identify genomic loci that show population-specific CE segregation. A high resolution CE ‘sliding window’ scan across the human genome, organised at the population level, revealed genes known to be under evolutionary pressure. These include SLC24A5 (European and Gujarati Indian skin pigmentation), HERC2 (European eye color), LCT (European and Maasai milk digestion) and EDAR (Asian hair thickness). We also identified a set of previously unidentified loci with high population-specific CE scores including the chromatin remodeler SCMH1 in Africans and EDA2R in Asians. Closer inspection reveals that these prioritised genomic regions do not correspond to simple runs of homozygosity but rather compositionally complex regions that are shared by many individuals of a given population. Unlike FST, CE analyses do not require ab initio population comparisons and are amenable to the hemizygous X chromosome.ConclusionsWe conclude with a discussion of the implications of CE for a complex systems science view of genome evolution. CE allows one to clearly visualise the evolution of individual genomes and populations through a formal, mathematically-rigorous information space. Overall, CE makes a set of biological predictions, some of which are unique and await functional validation.

[1]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[2]  Shivashankar H. Nagaraj,et al.  The evolution of tropical adaptation: comparing taurine and zebu cattle. , 2010, Animal genetics.

[3]  Yong-Sheng Ding,et al.  Eukaryotic Evolutionary Transitions Are Associated with Extreme Codon Bias in Functionally-Related Proteins , 2011, PloS one.

[4]  R. A. Fisher,et al.  The Genetical Theory of Natural Selection , 1931 .

[5]  J E Pryce,et al.  Evidence for pleiotropism and recent selection in the PLAG1 region in Australian Beef cattle. , 2013, Animal genetics.

[6]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[7]  Mathieu Gautier,et al.  Footprints of selection in the ancestral admixture of a New World Creole cattle breed , 2011, Molecular ecology.

[8]  Thomas D. Schneider,et al.  Fast Multiple Alignment of Ungapped DNA Sequences Using Information Theory and a Relaxation Method , 1996, Discret. Appl. Math..

[9]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[10]  Ming Jiang,et al.  ARID4A and ARID4B regulate male fertility, a functional link to the AR and RB pathways , 2013, Proceedings of the National Academy of Sciences.

[11]  C. Stringer,et al.  Genetic and fossil evidence for the origin of modern humans. , 1988, Science.

[12]  Mathieu Gautier,et al.  A whole genome Bayesian scan for adaptive genetic divergence in West African cattle , 2009, BMC Genomics.

[13]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[14]  A Reverter,et al.  Detection of chromosome segments of zebu and taurine origin and their effect on beef production and growth. , 2011, Journal of animal science.

[15]  H. C. Lee,et al.  Quantitative measure of randomness and order for complete genomes. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Christopher G. Langton,et al.  Computation at the edge of chaos: Phase transitions and emergent computation , 1990 .

[17]  H Almagor Nucleotide distribution and the recognition of coding regions in DNA sequences: an information theory approach. , 1985, Journal of theoretical biology.

[18]  E. Schrödinger What Is Life , 1946 .

[19]  H. Kinney,et al.  Expression of the Homeobox‐containing Genes EN1 and EN2 in Human Fetal Midgestational Medulla and Cerebellum , 1997, Journal of neuropathology and experimental neurology.

[20]  Jeremiah D. Degenhardt,et al.  A Simple Genetic Architecture Underlies Morphological Variation in Dogs , 2010, PLoS biology.

[21]  Eric S. Lander,et al.  Identifying Recent Adaptations in Large-Scale Genomic Data , 2013, Cell.

[22]  Kiyoshi Kawakami,et al.  Six1 and Eya1 Expression Can Reprogram Adult Muscle from the Slow-Twitch Phenotype into the Fast-Twitch Phenotype , 2004, Molecular and Cellular Biology.

[23]  Philipp W. Messer,et al.  Genome Patterns of Selection and Introgression of Haplotypes in Natural Populations of the House Mouse (Mus musculus) , 2012, PLoS genetics.

[24]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[25]  Robert D Schnabel,et al.  Genome-Wide Survey of SNP Variation Uncovers the Genetic Structure of Cattle Breeds , 2009, Science.

[26]  Nils Bertschinger,et al.  Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks , 2004, Neural Computation.

[27]  T. D. Schneider,et al.  Reading of DNA sequence logos: prediction of major groove binding by information theory. , 1996, Methods in enzymology.

[28]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[29]  R. Hanel,et al.  Living on the edge of chaos: minimally nonlinear models of genetic regulatory dynamics , 2010, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[30]  Fuli Yu,et al.  Characterizing linkage disequilibrium and evaluating imputation power of human genomic insertion-deletion polymorphisms , 2012, Genome Biology.

[31]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.

[32]  Pieter Adriaans,et al.  Between Order and Chaos: The Quest for Meaningful Information , 2009, Theory of Computing Systems.

[33]  Yi Zhang,et al.  Relations between Shannon entropy and genome order index in segmenting DNA sequences. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[34]  Stuart A. Kauffman,et al.  The origins of order , 1993 .

[35]  Jonathan Scott Friedlaender,et al.  A Human Genome Diversity Cell Line Panel , 2002, Science.

[36]  Taane G. Clark,et al.  Detecting and characterizing genomic signatures of positive selection in global populations. , 2013, American journal of human genetics.

[37]  Bertrand Servin,et al.  Genome-Wide Analysis of the World's Sheep Breeds Reveals High Levels of Historic Mixture and Strong Recent Selection , 2012, PLoS biology.

[38]  W. Bialek,et al.  Are Biological Systems Poised at Criticality? , 2010, 1012.2242.

[39]  Ivan Erill,et al.  A reexamination of information theory-based methods for DNA-binding site identification , 2009, BMC Bioinformatics.

[40]  Tom Druet,et al.  Variants modulating the expression of a chromosome domain encompassing PLAG1 influence bovine stature , 2011, Nature Genetics.

[41]  G Fordyce,et al.  Genome-wide association studies of female reproduction in tropically adapted beef cattle. , 2012, Journal of animal science.

[42]  Tad S Sonstegard,et al.  Genomic divergence of zebu and taurine cattle identified through high-density SNP genotyping , 2012, BMC Genomics.

[43]  Bertrand Servin,et al.  Detecting Signatures of Selection Through Haplotype Differentiation Among Hierarchically Structured Populations , 2012, Genetics.

[44]  Leif Andersson,et al.  Domestic-animal genomics: deciphering the genetics of complex traits , 2004, Nature Reviews Genetics.

[45]  P. Taberlet,et al.  The power and promise of population genomics: from genotyping to genome typing , 2003, Nature Reviews Genetics.

[46]  Jong Bhak,et al.  PanSNPdb: The Pan-Asian SNP Genotyping Database , 2011, PloS one.

[47]  Alan Hodgkinson,et al.  Variation in the mutation rate across mammalian genomes , 2011, Nature Reviews Genetics.

[48]  James P. Crutchfield,et al.  Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations , 1993, Complex Syst..

[49]  Bruce D. Smith,et al.  Documenting domestication: the intersection of genetics and archaeology. , 2006, Trends in genetics : TIG.

[50]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[51]  Katsushi Tokunaga,et al.  A scan for genetic determinants of human hair morphology: EDAR is associated with Asian hair thickness. , 2008, Human molecular genetics.

[52]  Fang Yang,et al.  Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison , 2013, BMC Bioinformatics.

[53]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[54]  G. Hardy MENDELIAN PROPORTIONS IN A MIXED POPULATION. , 1908 .

[55]  N. Maizels,et al.  The G4 Genome , 2013, PLoS genetics.

[56]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[57]  M. Feldman,et al.  Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure , 2005, PLoS genetics.

[58]  A. S.,et al.  Estimating the Entropy of DNA Sequences , 1997 .

[59]  Chun Li,et al.  Recognition of protein coding genes in the yeast genome based on the relative-entropy of DNA. , 2006, Combinatorial chemistry & high throughput screening.

[60]  William Barendse,et al.  Genome wide signatures of positive selection: The comparison of independent samples and the identification of regions associated to traits , 2009, BMC Genomics.

[61]  BMC Bioinformatics , 2005 .

[62]  K. Lindblad-Toh,et al.  Whole-genome resequencing reveals loci under selection during chicken domestication , 2010, Nature.

[63]  Peter H. Sudmant,et al.  Evolution of Human-Specific Neural SRGAP2 Genes by Incomplete Segmental Duplication , 2012, Cell.

[64]  L. Hood,et al.  Gene expression dynamics in the macrophage exhibit criticality , 2008, Proceedings of the National Academy of Sciences.

[65]  Stephen J. O'Brien,et al.  Genome-wide scans for footprints of natural selection , 2010, Philosophical Transactions of the Royal Society B: Biological Sciences.

[66]  Mathieu Gautier,et al.  A Quasi-Exclusive European Ancestry in the Senepol Tropical Cattle Breed Highlights the Importance of the slick Locus in Tropical Adaptation , 2012, PloS one.

[67]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[68]  Michael B. Eisen,et al.  Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles , 2001, ISMB.

[69]  Gabriela Alexe,et al.  Lactase Persistence and Lipid Pathway Selection in the Maasai , 2012, PloS one.

[70]  R. A. Fisher,et al.  The Genetical Theory of Natural Selection , 1931 .

[71]  Tatsuo Fujita,et al.  Genome-wide association study identified three major QTL for carcass weight including the PLAG1-CHCHD7 QTN for stature in Japanese Black cattle , 2012, BMC Genetics.

[72]  Paolo Ajmone-Marsan,et al.  Identification of Selection Signatures in Cattle Breeds Selected for Dairy Production , 2010, Genetics.

[73]  N. Patterson,et al.  Estimating and interpreting FST: The impact of rare variants , 2013, Genome research.

[74]  Mark D Shriver,et al.  The timing of pigmentation lightening in Europeans. , 2013, Molecular biology and evolution.

[75]  J H Gillespie,et al.  The molecular nature of allelic diversity for two models of balancing selection. , 1990, Theoretical population biology.

[76]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[77]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[78]  J. Steven Leeder,et al.  Genome-wide prediction, display and refinement of binding sites with information theory-based models , 2003, BMC Bioinformatics.

[79]  Hans Eiberg,et al.  Blue eye color in humans may be caused by a perfectly associated founder mutation in a regulatory element located within the HERC2 gene inhibiting OCA2 expression , 2008, Human Genetics.

[80]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.