Improving the throughput of the forward population genetic simulation environment simuPOP

Biological populations arise, develop and evolve under a series of well-studied laws and fairly regular mechanisms. Population genetics is a field of science, that aims to study and model these laws and the genetic composition and diversity of populations of various types of species and life. At best, population genetic models can be of use in verifying past events of a population and eventually reconstructing unknown population histories in light of multidisciplinary evidence. An example case of this is the research concerning human population prehistory of Finland. Population simulations are a sub-branch of the rapidly developing field of bioinformatics and can be divided into two pipelines: forward-in-time and backward-in-time (coalescent). The methodologies enable in silico testing of the development of genetic composition of individuals in a well-defined population. This thesis focuses on the forward-in-time approach. Multiple pieces of software exist today for forward population simulations, and simuPOP [http://simupop.sourceforge.net] probably is the single most flexible one of them. Being able to incorporate transmission of genomes and arbitrary individual information between generations, simuPOP has potential applications even beyond population genetics. However, simuPOP tends to use an enormous amount of computer random access memory when simulating large population sizes. This thesis introduces three approaches to improve the throughput of simuPOP. These are i) introducing scripting guidelines, ii) approximating a complex simulation using the inbuilt biallelic mode of simuPOP and iii) changes in the source code of simuPOP that would enable improved throughput. A previous simuPOP script designed to simulate past demographic events of Finnish population history is used as an example. A batch of 100 simulation runs is run on three versions of the previous script: standard, modified and biallelic. As compared to the standard mode, the modified simulation script performs marginally faster. Despite doubling the user time of a single simulation run, the biallelic approximation method proves to consume three times less random access memory still being compatible from the population genetic point of view. This suggests that built-in support for the biallelic approximation could be a valuable supplement to simuPOP. Evidently, simuPOP can be applied to very complex forward population simulations. The use of individual information fields enables the user to set up arbitrary simulation scenarios. Data structure changes at source code level are likely to improve throughput even further. Besides introducing improvements and guidelines to the simulation workflow, this thesis is a standalone case study concerning the use and development of a bioinformatics software. Furthermore, an individual development version of simuPOP called simuPOP-rev is founded with the goal of implementing the source code changes suggested in this thesis. ACM Computing Classification System (CCS): D.1 [Programming techniques], G.1.6…

[1]  Wolfgang Gerlach,et al.  Compressed suffix tree - a basis for genome-scale sequence analysis , 2007, Bioinform..

[2]  B. Charlesworth Effective population size and patterns of molecular evolution and variation , 2009, Nature Reviews Genetics.

[3]  A. Sajantila,et al.  Mutations at Y-STR loci: implications for paternity testing and forensic analysis. , 2001, Forensic science international.

[4]  L. Excoffier,et al.  Computer programs for population genetics data analysis: a survival guide , 2006, Nature Reviews Genetics.

[5]  R. Nielsen,et al.  Distinguishing migration from isolation: a Markov chain Monte Carlo approach. , 2001, Genetics.

[6]  Sverker Nilsson,et al.  Heapy: A Memory Profiler and Debugger for Python , 2006 .

[7]  C. Tyler-Smith,et al.  Human Evolutionary Genetics , 2004 .

[8]  Marek Kimmel,et al.  simuPOP: a forward-time population genetics simulation environment , 2005, Bioinform..

[9]  M. Nei Analysis of gene diversity in subdivided populations. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[10]  S. Pääbo,et al.  Paternal and maternal DNA lineages reveal a bottleneck in the founding of the Finnish population. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[11]  J. Kere,et al.  Human population genetics: lessons from Finland. , 2001, Annual review of genomics and human genetics.

[12]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[13]  P. Onkamo,et al.  Modelling a Neolithic Population Bottleneck in Finland: A Genetic Simulation , 2010 .

[14]  A. Sajantila,et al.  Finnish mitochondrial DNA HVS-I and HVS-II population data. , 2007, Forensic science international.

[15]  F. Balloux EASYPOP (version 1.7): a computer program for population genetics simulations. , 2001, The Journal of heredity.

[16]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[17]  D. Balding,et al.  Approximate Bayesian computation in population genetics. , 2002, Genetics.

[18]  Marek Kimmel,et al.  Forward-Time Simulations of Human Populations with Complex Diseases , 2007, PLoS genetics.

[19]  Nicholas Nethercote,et al.  Using Valgrind to Detect Undefined Value Errors with Bit-Precision , 2005, USENIX Annual Technical Conference, General Track.

[20]  R. Durbin,et al.  Identity-by-Descent-Based Phasing and Imputation in Founder Populations Using Graphical Models , 2011, Genetic epidemiology.

[21]  F. Crick Central Dogma of Molecular Biology , 1970, Nature.

[22]  G. Garber,et al.  Three-Year Outbreak of Pseudobacteremia With Burkholderia cepacia Traced to a Contaminated Blood Gas Analyzer , 1996, Infection Control & Hospital Epidemiology.

[23]  Nicolas Ray,et al.  Inferring Past Demography Using Spatially Explicit Population Genetic Models , 2009, Human biology.

[24]  Pekka Ellonen,et al.  Genetic markers and population history: Finland revisited , 2009, European Journal of Human Genetics.

[25]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.

[26]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[27]  J. Shapiro,et al.  Revisiting the Central Dogma in the 21st Century , 2009, Annals of the New York Academy of Sciences.

[28]  E. Bingen,et al.  Outbreak of Burkholderia cepacia Bacteremia in a Pediatric Hospital Due to Contamination of Lipid Emulsion Stoppers , 2004, Journal of Clinical Microbiology.

[29]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[30]  P. Donnelly,et al.  The mutation rate in the human mtDNA control region. , 2000, American journal of human genetics.

[31]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[32]  Bjarne Stroustrup,et al.  The C++ Programming Language: Special Edition , 2000 .

[33]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[34]  Zarko Stankovski ARLEQUIN: an integrated Java application , 2001, JGI '01.

[35]  D. Labuda,et al.  Phylogenetic and familial estimates of mitochondrial substitution rates: study of control region mutations in deep-rooting pedigrees. , 2001, American journal of human genetics.

[36]  F. Crick,et al.  Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid , 1974, Nature.

[37]  Jinyan Li,et al.  Maximization of negative correlations in time-course gene expression data for enhancing understanding of molecular pathways , 2009, Nucleic acids research.

[38]  K. Holsinger,et al.  Genetics in geographically structured populations: defining, estimating and interpreting FST , 2009, Nature Reviews Genetics.

[39]  Motoo Kimura,et al.  A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population*. , 1973, Genetical research.

[40]  M. Tallavaara,et al.  Prehistoric population history in eastern Fennoscandia , 2010 .

[41]  L. Excoffier,et al.  SIMCOAL: a general coalescent program for the simulation of molecular data in interconnected populations with arbitrary demography. , 2000, The Journal of heredity.