Interpretation and approximation tools for big, dense Markov chain transition matrices in population genetics

AbstractBackgroundMarkov chains are a common framework for individual-based state and time discrete models in evolution. Though they played an important role in the development of basic population genetic theory, the analysis of more complex evolutionary scenarios typically involves approximation with other types of models. As the number of states increases, the big, dense transition matrices involved become increasingly unwieldy. However, advances in computational technology continue to reduce the challenges of “big data”, thus giving new potential to state-rich Markov chains in theoretical population genetics.ResultsUsing a population genetic model based on genotype frequencies as an example, we propose a set of methods to assist in the computation and interpretation of big, dense Markov chain transition matrices. With the help of network analysis, we demonstrate how they can be transformed into clear and easily interpretable graphs, providing a new perspective even on the classic case of a randomly mating, finite population with mutation. Moreover, we describe an algorithm to save computer memory by substituting the original matrix with a sparse approximate while preserving its mathematically important properties, including a closely corresponding dominant (normalized) eigenvector. A global sensitivity analysis of the approximation results in our example shows that size reduction of more than 90 % is possible without significantly affecting the basic model results. Sample implementations of our methods are collected in the Python module mamoth.ConclusionOur methods help to make stochastic population genetic models involving big, dense transition matrices computationally feasible. Our visualization techniques provide new ways to explore such models and concisely present the results. Thus, our methods will contribute to establish state-rich Markov chains as a valuable supplement to the diversity of population genetic models currently employed, providing interesting new details about evolution e.g. under non-standard reproductive systems such as partial clonality.

[1]  A. Einstein Über einen die Erzeugung und Verwandlung des Lichtes betreffenden heuristischen Gesichtspunkt [AdP 17, 132 (1905)] , 2005, Annalen der Physik.

[2]  Timothy A. Davis,et al.  Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2) , 2006 .

[3]  O. Perron Zur Theorie der Matrices , 1907 .

[4]  H. Ellegren,et al.  Mutation rate variation in the mammalian genome. , 2003, Current opinion in genetics & development.

[5]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[6]  Jack Dongarra,et al.  Linear algebra on high performance computers , 1986 .

[7]  J. Masson,et al.  The Exact Distributions of F IS under Partial Asexuality in Small Finite Populations with Mutation , 2014, PloS one.

[8]  Stefan Finsterle,et al.  Making sense of global sensitivity analyses , 2014, Comput. Geosci..

[9]  Bashir Alam,et al.  Generalization of Dijkstra's Algorithm for Extraction of Shortest Paths in Directed multigraphs , 2013, J. Comput. Sci..

[10]  Evan G. Cooch,et al.  A primer on the application of Markov chains to the study of wildlife disease dynamics , 2010 .

[11]  Gili Greenbaum Revisiting the time until fixation of a neutral mutant in a finite population - A coalescent theory approach. , 2015, Journal of theoretical biology.

[12]  J. Drake A constant rate of spontaneous mutation in DNA-based microbes. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Hayder Radha,et al.  Transitivity matrix of social network graphs , 2012, 2012 IEEE Statistical Signal Processing Workshop (SSP).

[14]  Timothy A. Davis,et al.  Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization , 2011, TOMS.

[15]  Mario Paruggia,et al.  Sensitivity Analysis in Practice: A Guide to Assessing Scientific Models , 2006 .

[16]  M. E. Orive Effective population size in organisms with complex life-histories. , 1993, Theoretical population biology.

[17]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[18]  M. Planck Zur Theorie des Gesetzes der Energieverteilung im Normalspectrum , 1900 .

[19]  Timothy A. Davis,et al.  Direct methods for sparse linear systems , 2006, Fundamentals of algorithms.

[20]  O. Loudet,et al.  Influence of mutation rate on estimators of genetic differentiation - lessons from Arabidopsis thaliana , 2010, BMC Genetics.

[21]  L. Allen An introduction to stochastic processes with applications to biology , 2003 .

[22]  Ameet Talwalkar,et al.  Matrix Approximation for Large-scale Learning , 2010 .

[23]  N. Ellstrand,et al.  POPULATION GENETIC CONSEQUENCES OF SMALL POPULATION SIZE: Implications for Plant Conservation , 1993 .

[24]  Andrew J. Tyre,et al.  A simple method for dealing with large state spaces , 2012 .

[25]  G. Hardy MENDELIAN PROPORTIONS IN A MIXED POPULATION. , 1908 .

[26]  Stefano Tarantola,et al.  Sensitivity Analysis in Practice: A Guide to Assessing Scientific Models , 2004 .

[27]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[28]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[29]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[30]  Sewall Wright,et al.  Coefficients of Inbreeding and Relationship , 1922, The American Naturalist.

[31]  Timothy R. C. Read,et al.  Multinomial goodness-of-fit tests , 1984 .

[32]  Ameet Talwalkar,et al.  On sampling-based approximate spectral decomposition , 2009, ICML '09.

[33]  J. S. Gale,et al.  Theoretical Population Genetics , 1990, Springer Netherlands.

[34]  A. Brookes The essence of SNPs. , 1999, Gene.

[35]  A. Ceplitis Coalescence times and the Meselson effect in asexual eukaryotes. , 2003, Genetical research.

[36]  Troy Day,et al.  A Biologist's Guide to Mathematical Modeling in Ecology and Evolution , 2007 .

[37]  Max D. Morris,et al.  Factorial sampling plans for preliminary computational experiments , 1991 .

[38]  F. Balloux,et al.  Tackling the population genetics of clonal and partially clonal organisms. , 2005, Trends in ecology & evolution.

[39]  W. Ewens Mathematical Population Genetics : I. Theoretical Introduction , 2004 .

[40]  W. Ewens Mathematical Population Genetics , 1980 .

[41]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[42]  A. McKane,et al.  Stochastic formulation of ecological models and their applications. , 2012, Trends in ecology & evolution.

[43]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[44]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[45]  Kun Deng,et al.  Model reduction of Markov chains with applications to building systems , 2012 .