Predicting and understanding the stability of G-quadruplexes

Motivation: G-quadruplexes are stable four-stranded guanine-rich structures that can form in DNA and RNA. They are an important component of human telomeres and play a role in the regulation of transcription and translation. The biological significance of a G-quadruplex is crucially linked with its thermodynamic stability. Hence the prediction of G-quadruplex stability is of vital interest. Results: In this article, we present a novel Bayesian prediction framework based on Gaussian process regression to determine the thermodynamic stability of previously unmeasured G-quadruplexes from the sequence information alone. We benchmark our approach on a large G-quadruplex dataset and compare our method to alternative approaches. Furthermore, we propose an active learning procedure which can be used to iteratively acquire data in an optimal fashion. Lastly, we demonstrate the usefulness of our procedure on a genome-wide study of quadruplexes in the human genome. Availability: A data table with the training sequences is available as supplementary material. Source code is available online at http://www.inference.phy.cam.ac.uk/os252/projects/quadruplexes Contact: os252@cam.ac.uk; jlh29@cam.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Laurence H. Hurley,et al.  Structures, folding patterns, and functions of intramolecular DNA G-quadruplexes found in eukaryotic promoter regions. , 2008, Biochimie.

[2]  J. Mergny,et al.  Quadruplex-based molecular beacons as tunable DNA probes. , 2006, Journal of the American Chemical Society.

[3]  T. Bryan,et al.  Physiological relevance of telomeric G‐quadruplex formation: a potential drug target , 2007, BioEssays : news and reviews in molecular, cellular and developmental biology.

[4]  A. Lane,et al.  Stability and kinetics of G-quadruplex structures , 2008, Nucleic acids research.

[5]  Klaus Obermayer,et al.  Gaussian process regression: active data selection and test point rejection , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[6]  Oliver Stegle,et al.  Gaussian Process Robust Regression for Noisy Heart Rate Data , 2008, IEEE Transactions on Biomedical Engineering.

[7]  Thomas P. Minka,et al.  Divergence measures and message passing , 2005 .

[8]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[9]  Malte Kuss,et al.  Approximate inference for robust Gaussian process regression , 2005 .

[10]  J. SantaLucia,et al.  A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Shankar Balasubramanian,et al.  G-quadruplexes in promoters throughout the human genome , 2006, Nucleic acids research.

[12]  Jean-Louis Mergny,et al.  Following G‐quartet formation by UV‐spectroscopy , 1998, FEBS letters.

[13]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[14]  M. Seeger Expectation Propagation for Exponential Families , 2005 .

[15]  D. Bearss,et al.  Direct evidence for a G-quadruplex in a promoter region and its targeting with a small molecule to repress c-MYC transcription , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Wei Chu,et al.  Biomarker discovery in microarray gene expression data with Gaussian processes , 2005, Bioinform..

[17]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[18]  Stephen Neidle,et al.  Loop-length-dependent folding of G-quadruplexes. , 2004, Journal of the American Chemical Society.

[19]  Shankar Balasubramanian,et al.  Prevalence of quadruplexes in the human genome , 2005, Nucleic acids research.

[20]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[21]  Julian Leon Huppert,et al.  Four-stranded nucleic acids: structure, function and targeting of G-quadruplexes. , 2008, Chemical Society reviews.

[22]  Sarah W. Burge,et al.  Quadruplex DNA: sequence, topology and structure , 2006, Nucleic acids research.

[23]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[24]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[25]  Shankar Balasubramanian,et al.  An RNA G-quadruplex in the 5' UTR of the NRAS proto-oncogene modulates translation. , 2007, Nature chemical biology.

[26]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[27]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[28]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[29]  S. Neidle,et al.  Highly prevalent putative quadruplex sequence motifs in human DNA , 2005, Nucleic acids research.

[30]  Dinshaw J. Patel,et al.  Human telomere, oncogenic promoter and 5′-UTR G-quadruplexes: diverse higher order DNA and RNA targets for cancer therapeutics , 2007, Nucleic acids research.

[31]  Julian Leon Huppert,et al.  Four-Stranded Nucleic Acids: Structure, Function and Targeting of G-Quadruplexes , 2008 .

[32]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[33]  Shankar Balasubramanian,et al.  A sequence-independent study of the influence of short loop lengths on the stability and topology of intramolecular DNA G-quadruplexes. , 2008, Biochemistry.