Torus principal component analysis with applications to RNA structure

There are several cutting edge applications needing PCA methods for data on tori and we propose a novel torus-PCA method with important properties that can be generally applied. There are two existing general methods: tangent space PCA and geodesic PCA. However, unlike tangent space PCA, our torus-PCA honors the cyclic topology of the data space whereas, unlike geodesic PCA, our torus-PCA produces a variety of non-winding, non-dense descriptors. This is achieved by deforming tori into spheres and then using a variant of the recently developed principle nested spheres analysis. This PCA analysis involves a step of small sphere fitting and we provide an improved test to avoid overfitting. However, deforming tori into spheres creates singularities. We introduce a data-adaptive pre-clustering technique to keep the singularities away from the data. For the frequently encountered case that the residual variance around the PCA main component is small, we use a post-mode hunting technique for more fine-grained clustering. Thus in general, there are three successive interrelated key steps of torus-PCA in practice: pre-clustering, deformation, and post-mode hunting. We illustrate our method with two recently studied RNA structure (tori) data sets: one is a small RNA data set which is established as the benchmark for PCA and we validate our method through this data. Another is a large RNA data set (containing the small RNA data set) for which we show that our method provides interpretable principal components as well as giving further insight into its structure.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  GeoPCA: a new tool for multivariate analysis of dihedral angles based on principal component geodesics , 2012, Nucleic acids research.

[3]  Helen M Berman,et al.  RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution). , 2008, RNA.

[4]  Mark A. van de Wiel,et al.  Semi-supervised adaptive-height snipping of the hierarchical clustering tree , 2015, BMC Bioinformatics.

[5]  Thomas Hermann,et al.  Simulations of the dynamics at an RNA–protein interface , 1999, Nature Structural Biology.

[6]  Wei Liu,et al.  A Mathematical Framework for Protein Structure Comparison , 2011, PLoS Comput. Biol..

[7]  A. Pyle,et al.  Stepping through an RNA structure: A novel approach to conformational analysis. , 1998, Journal of molecular biology.

[8]  Nicholas Ayache,et al.  Principal Spine Shape Deformation Modes Using Riemannian Geometry and Articulated Models , 2006, AMDO.

[9]  Nicholas Ayache,et al.  A Log-Euclidean Framework for Statistics on Diffeomorphisms , 2006, MICCAI.

[10]  David H Mathews,et al.  RNA structure prediction: an overview of methods. , 2012, Methods in molecular biology.

[11]  David C Richardson,et al.  Computational Methods for RNA Structure Validation and Improvement. , 2015, Methods in enzymology.

[12]  James Stephen Marron,et al.  Generalized PCA via the Backward Stepwise Approach in Image Analysis , 2010 .

[13]  John D. Westbrook,et al.  Tools for the automatic identification and classification of RNA base pairs , 2003, Nucleic Acids Res..

[14]  A. Munk,et al.  INTRINSIC SHAPE ANALYSIS: GEODESIC PCA FOR RIEMANNIAN MANIFOLDS MODULO ISOMETRIC LIE GROUP ACTIONS , 2007 .

[15]  Peter J. Green,et al.  Bayesian alignment using hierarchical models, with applications in protein bioinformatics , 2005 .

[16]  J. Gower Generalized procrustes analysis , 1975 .

[17]  J. Warwicker,et al.  Simulation of non-specific protein–mRNA interactions , 2005, Nucleic acids research.

[18]  J. Brewer Regulatory crosstalk within the mammalian unfolded protein response , 2013, Cellular and Molecular Life Sciences.

[19]  P. Thomas Fletcher,et al.  Principal geodesic analysis for the study of nonlinear statistics of shape , 2004, IEEE Transactions on Medical Imaging.

[20]  Stefan Sommer,et al.  Horizontal Dimensionality Reduction and Iterated Frame Bundle Development , 2013, GSI.

[21]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[22]  Angel E García,et al.  High-resolution reversible folding of hyperstable RNA tetraloops using molecular dynamics simulations , 2013, Proceedings of the National Academy of Sciences.

[23]  Helen M Berman,et al.  RNA conformational classes. , 2004, Nucleic acids research.

[24]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[25]  Kanti V. Mardia,et al.  A Probabilistic Model of RNA Conformational Space , 2009, PLoS Comput. Biol..

[26]  Wolfram Saenger,et al.  Principles of Nucleic Acid Structure , 1983 .

[27]  Axel Munk,et al.  Multiscale methods for shape constraints in deconvolution: Confidence statements for qualitative features. , 2011, 1107.1404.

[28]  J. S. Marron,et al.  Principal arc analysis on direct product manifolds , 2011, 1104.3472.

[29]  W. B. Arendall,et al.  RNA backbone is rotameric , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Jack Snoeyink,et al.  Nucleic Acids Research Advance Access published April 22, 2007 MolProbity: all-atom contacts and structure validation for proteins and nucleic acids , 2007 .

[31]  Anuj Srivastava,et al.  Functional and Shape Data Analysis , 2016 .

[32]  N. Usman,et al.  RNA hydration: a detailed look. , 1996, Biochemistry.

[33]  T. Hotz,et al.  Intrinsic means on the circle: uniqueness, locus and asymptotics , 2011, 1108.2141.

[34]  Anuj Srivastava,et al.  RNA global alignment in the joint sequence–structure space using elastic shape analysis , 2013, Nucleic acids research.

[35]  Ying Zhao,et al.  Molecular dynamics simulation studies of a protein–RNA complex with a selectively modified binding interface , 2006, Biopolymers.

[36]  L. Duembgen,et al.  Multiscale inference about a density , 2007, 0706.3968.

[37]  Jaroslav Koča,et al.  Molecular dynamic simulations of protein/RNA complexes: CRISPR/Csy4 endoribonuclease. , 2015, Biochimica et biophysica acta.

[38]  Bohdan Schneider,et al.  Automatic workflow for the classification of local DNA conformations , 2013, BMC Bioinformatics.

[39]  Roland L. Dunbrack,et al.  Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains , 1994, Nature Structural Biology.

[40]  Bin Zhang,et al.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R , 2008, Bioinform..

[41]  K. Mardia Statistical approaches to three key challenges in protein structural bioinformatics , 2013 .

[42]  J. Varner,et al.  A review of the mammalian unfolded protein response , 2011, Biotechnology and bioengineering.

[43]  J. Weickert,et al.  The circular SiZer, inferred persistence of shape parameters and application to early stem cell differentiation. , 2014, 1404.3300.

[44]  Anna Marie Pyle,et al.  Evaluating and learning from RNA pseudotorsional space: quantitative validation of a reduced representation for RNA structure. , 2007, Journal of molecular biology.

[45]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .

[46]  P. Walter,et al.  Intracellular signaling from the endoplasmic reticulum to the nucleus. , 1998, Annual review of cell and developmental biology.

[47]  Gerhard Stock,et al.  Construction of the free energy landscape of biomolecules via dihedral angle principal component analysis. , 2008, The Journal of chemical physics.

[48]  J. Marron,et al.  Analysis of principal nested spheres. , 2012, Biometrika.

[49]  Stephan Huckemann,et al.  Principal component analysis for Riemannian manifolds, with an application to triangular shape spaces , 2006, Advances in Applied Probability.