Basic exploratory proteins analysis with statistical methods applied on structural features

Exploratory Data Analysis (EDA) is an approach for summarizing and visualizing the important characteristics of a data set, in order to make a prearranged data screening and display multivariate data in a graphical way, to render them more comprehensible. Moreover, it reveals hidden aspects within the simple evaluations. In particular, EDA is suitable for datasets with comparable variables, as structural-geometrical protein features. In this work, we analyzed some proteins belonging to ten different architectural families. After retrieval, feature selection and normalization stages, the dataset has been processed by means of simple correlation, partial correlation and principal component analysis (PCA), highlighting family-independent or family-specific relationships, and possible outliers for the dataset itself. The results can be useful to connect these features to functional protein properties.

[1]  M. H Fulekar,et al.  Bioinformatics : applications in life and environmental sciences , 2009 .

[2]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[3]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[4]  G. Quinn,et al.  Experimental Design and Data Analysis for Biologists , 2002 .

[5]  Ragone,et al.  Helix-stabilizing factors and stabilization of thermophilic proteins: an X-ray based study. , 1998, Protein engineering.

[6]  Trey Ideker,et al.  Cytoscape 2.8: new features for data integration and network visualization , 2010, Bioinform..

[7]  Yanrui Ding,et al.  Application of principal component analysis to determine the key structural features contributing to iron superoxide dismutase thermostability. , 2012, Biopolymers.

[8]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[9]  David S. Wishart,et al.  VADAR: a web server for quantitative evaluation of protein structure quality , 2003, Nucleic Acids Res..

[10]  I. Jolliffe Principal Component Analysis , 2002 .

[11]  Giovanni Colonna,et al.  ESBRI: A web server for evaluating salt bridges in proteins , 2008, Bioinformation.

[12]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[13]  Hadley Wickham,et al.  A Layered Grammar of Graphics , 2010 .

[14]  Leo S. D. Caves,et al.  Bio3d: An R Package , 2022 .

[15]  Anna Marabotti,et al.  Energy‐based prediction of amino acid‐nucleotide base recognition , 2008, J. Comput. Chem..

[16]  David A. Lee,et al.  New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures , 2012, Nucleic Acids Res..

[17]  G. Ullmann,et al.  McVol - A program for calculating protein volumes and identifying cavities by a Monte Carlo algorithm , 2010, Journal of molecular modeling.

[18]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[19]  A. Marabotti,et al.  Analysis of galactosemia-linked mutations of GALT enzyme using a computational biology approach. , 2010, Protein engineering, design & selection : PEDS.

[20]  S. Costantini,et al.  Interleukin-10 expression by real-time PCR and homology modelling analysis in the European sea bass (Dicentrarchus Labrax L.) , 2007 .

[21]  Antonio d'Acierno,et al.  GALT Protein Database: Querying Structural and Functional Features of GALT Enzyme , 2014, Human mutation.

[22]  Anna Marabotti,et al.  Theoretical model of the three-dimensional structure of a sugar-binding protein from Pyrococcus horikoshii: structural analysis and sugar-binding simulations. , 2004, The Biochemical journal.

[23]  A. L. Edwards,et al.  Multiple Regression and the Analysis of Variance and Covariance , 1986, The Mathematical Gazette.

[24]  M. Capogrossi,et al.  Platelet-derived Growth Factor-BB and Basic Fibroblast Growth Factor Directly Interact in Vitro with High Affinity* , 2002, The Journal of Biological Chemistry.

[25]  S. Costantini,et al.  Molecular characterisation and structural analysis of an interferon homologue in sea bass (Dicentrarchus labrax L.). , 2009, Molecular immunology.

[26]  Seongho Kim ppcor: An R Package for a Fast Calculation to Semi-partial Correlation Coefficients. , 2015, Communications for statistical applications and methods.