An overview of recent developments in genomics and associated statistical methods

The landscape of genomics has changed drastically in the last two decades. Increasingly inexpensive sequencing has shifted the primary focus from the acquisition of biological sequences to the study of biological function. Assays have been developed to study many intricacies of biological systems, and publicly available databases have given rise to integrative analyses that combine information from many sources to draw complex conclusions. Such research was the focus of the recent workshop at the Isaac Newton Institute, ‘High dimensional statistics in biology’. Many computational methods from modern genomics and related disciplines were presented and discussed. Using, as much as possible, the material from these talks, we give an overview of modern genomics: from the essential assays that make data-generation possible, to the statistical methods that yield meaningful inference. We point to current analytical challenges, where novel methods, or novel applications of extant methods, are presently needed.

[1]  The origin of species : by means of natural selection, or, The preservation of favoured races in the struggle for life / by Charles Darwin. , 1889 .

[2]  Roland P. Falkner,et al.  History of statistics , 1891 .

[3]  W. E. Ritter AS TO THE CAUSES OF EVOLUTION. , 1923, Science.

[4]  R. Punnett,et al.  The Genetical Theory of Natural Selection , 1930, Nature.

[5]  S. Wright THE GENETICAL THEORY OF NATURAL SELECTIONA Review , 1930 .

[6]  F. Crick,et al.  Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid , 1953, Nature.

[7]  S. H. Lawrence,et al.  Electrophoresis: Theory, Methods and Applications , 1960 .

[8]  F. Crick Central Dogma of Molecular Biology , 1970, Nature.

[9]  W. Fiers,et al.  Nucleotide Sequence of the Gene Coding for the Bacteriophage MS2 Coat Protein , 1972, Nature.

[10]  H. Kröger,et al.  [Protein synthesis]. , 1974, Fortschritte der Medizin.

[11]  E. Southern Detection of specific sequences among DNA fragments separated by gel electrophoresis. , 1975, Journal of molecular biology.

[12]  U. Schibler,et al.  Changes in size and secondary structure of the ribosomal transcription unit during vertebrate evolution. , 1975, Journal of molecular biology.

[13]  A. Griffiths Introduction to Genetic Analysis , 1976 .

[14]  F. Sanger,et al.  Nucleotide sequence of bacteriophage φX174 DNA , 1977, Nature.

[15]  F. Sanger,et al.  Nucleotide sequence of bacteriophage phi X174 DNA. , 1977, Nature.

[16]  W. Gilbert Why genes in pieces? , 1978, Nature.

[17]  O. Avery,et al.  STUDIES ON THE CHEMICAL NATURE OF THE SUBSTANCE INDUCING TRANSFORMATION OF PNEUMOCOCCAL TYPES , 1944, The Journal of experimental medicine.

[18]  H. Towbin,et al.  Electrophoretic transfer of proteins from polyacrylamide gels to nitrocellulose sheets: procedure and some applications. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[19]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[20]  M. Kimura The Neutral Theory of Molecular Evolution: Introduction , 1983 .

[21]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[22]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[23]  Lloyd M. Smith,et al.  Fluorescence detection in automated DNA sequence analysis , 1986, Nature.

[24]  J. Lis,et al.  Protein-DNA cross-linking reveals dramatic variation in RNA polymerase II density on different histone repeats of Drosophila melanogaster , 1987, Molecular and cellular biology.

[25]  J. J. Greene,et al.  Identification of interferon-modulated proliferation-related cDNA sequences. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[26]  T. Kunkel,et al.  Fidelity of DNA synthesis by the Thermus aquaticus DNA polymerase. , 1988, Biochemistry.

[27]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[28]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  F. Weiling Historical study: Johann Gregor Mendel 1822-1884. , 1991, American journal of medical genetics.

[31]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[32]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[33]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[34]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[35]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[36]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[37]  M D Biggin,et al.  Redundant control of Ultrabithorax by zeste involves functional levels of zeste protein binding at the Ultrabithorax promoter. , 1996, Development.

[38]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[39]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[40]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[41]  S. Karlin,et al.  Finding the genes in genomic DNA. , 1998, Current opinion in structural biology.

[42]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[43]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[44]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[45]  Micro-scale Analysis of Lipids by Far-eastern Blot (TLC Blot) , 1998 .

[46]  M. Biggin,et al.  The specificity of protein-DNA crosslinking by formaldehyde: in vitro and in drosophila embryos. , 2000, Nucleic acids research.

[47]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[48]  Charalambos P. Kyriacou,et al.  As good as it gets: An Introduction to Genetic Analysis (7th Edn) by A.J.F. Griffiths, J.H. Miller, D.T. Suzuki, R.C. Lewontin and W.M. Gelbart , 2000 .

[49]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[50]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[51]  M. Gerstein,et al.  Interrelating different types of genomic data, from proteome to secretome: 'oming in on function. , 2001, Genome research.

[52]  Trevor Hastie,et al.  The Elements of Statistical Learning Theory , 2001 .

[53]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[54]  D. Zwijnenburg,et al.  Relative quantification of 40 nucleic acid sequences by multiplex ligation-dependent probe amplification. , 2002, Nucleic acids research.

[55]  J. Dekker,et al.  Capturing Chromosome Conformation , 2002, Science.

[56]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[57]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[58]  Terence P. Speed,et al.  Comparison of Methods for Image Analysis on cDNA Microarray Data , 2002 .

[59]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[60]  John D. Storey A direct approach to false discovery rates , 2002 .

[61]  R. Viertl On the Future of Data Analysis , 2002 .

[62]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[63]  J. Collins,et al.  Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling , 2003, Science.

[64]  Anirvan M. Sengupta,et al.  A biophysical approach to transcription factor binding site discovery. , 2003, Genome research.

[65]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[66]  J. Hasty,et al.  Reverse engineering gene networks: Integrating genetic perturbations with dynamical modeling , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[67]  D. Stirling,et al.  A short history of the polymerase chain reaction. , 2003, Methods in molecular biology.

[68]  F. Collins,et al.  The Human Genome Project: Lessons from Large-Scale Biology , 2003, Science.

[69]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[70]  J. Kawai,et al.  Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[71]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[72]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[73]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[74]  S. Rafii,et al.  Splitting vessels: Keeping lymph apart from blood , 2003, Nature Medicine.

[75]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[76]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[77]  R. Fraser The structure of deoxyribose nucleic acid. , 2004, Journal of structural biology.

[78]  C. Darwin Charles Darwin The Origin of Species by means of Natural Selection or The Preservation of Favoured Races in the Struggle for Life , 2004 .

[79]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[80]  N. Meinshausen,et al.  Consistent neighbourhood selection for sparse high-dimensional graphs with the Lasso , 2004 .

[81]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[82]  Y. Pilpel,et al.  Transcription control reprogramming in genetic backup circuits , 2005, Nature Genetics.

[83]  Ronald R. Coifman,et al.  Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators , 2005, NIPS.

[84]  Mark Gerstein,et al.  Analysis of Genomic Tiling Microarrays for Transcript Mapping and the Identification of Transcription Factor Binding Sites , 2005, BSB.

[85]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[86]  Martha L Bulyk,et al.  DNA microarray technologies for measuring protein-DNA interactions. , 2006, Current opinion in biotechnology.

[87]  Ying Li,et al.  Northern Blot analysis of mRNA from mammalian polyribosomes , 2006 .

[88]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[89]  William Stafford Noble,et al.  Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays , 2006, Nature Methods.

[90]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[91]  Neil Hall,et al.  Advanced sequencing technologies and their wider impact in microbiology , 2007, Journal of Experimental Biology.

[92]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[93]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[94]  Mathieu Blanchette,et al.  Computation and analysis of genomic multi-sequence alignments. , 2007, Annual review of genomics and human genetics.

[95]  Pall I. Olason,et al.  A human phenome-interactome network of protein complexes implicated in genetic disorders , 2007, Nature Biotechnology.

[96]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[97]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[98]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[99]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[100]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[101]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[102]  A. Fraser,et al.  A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans , 2008, Nature Genetics.

[103]  Kenneth Rice,et al.  Comment: Microarrays, Empirical Bayes and the Two-Groups Model , 2008 .

[104]  Geoffrey J. McLachlan,et al.  Clustering of Microarray Data via Mixture Models , 2008 .

[105]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[106]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[107]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[108]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[109]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[110]  L. Williams,et al.  Contents , 2020, Ophthalmology (Rochester, Minn.).

[111]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[112]  David J. Spiegelhalter,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[113]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[114]  E. Segal,et al.  Predicting expression patterns from regulatory sequence in Drosophila segmentation , 2008, Nature.

[115]  Tomas W. Fitzgerald,et al.  A robust statistical method for case-control association testing with copy number variation , 2008, Nature Genetics.

[116]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[117]  N. Luscombe,et al.  Principles of transcriptional regulation and evolution of the metabolic system in E. coli. , 2009, Genome research.

[118]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[119]  Yoav Benjamini,et al.  Selective inference in complex research , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[120]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[121]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .