Network-based protein structural classification

Experimental determination of protein function is resource-consuming. As an alternative, computational prediction of protein function has received attention. In this context, protein structural classification (PSC) can help, by allowing for determining structural classes of currently unclassified proteins based on their features, and then relying on the fact that proteins with similar structures have similar functions. Existing PSC approaches rely on sequence-based or direct three-dimensional (3D) structure-based protein features. By contrast, we first model 3D structures of proteins as protein structure networks (PSNs). Then, we use network-based features for PSC. We propose the use of graphlets, state-of-the-art features in many research areas of network science, in the task of PSC. Moreover, because graphlets can deal only with unweighted PSNs, and because accounting for edge weights when constructing PSNs could improve PSC accuracy, we also propose a deep learning framework that automatically learns network features from weighted PSNs. When evaluated on a large set of approximately 9400 CATH and approximately 12 800 SCOP protein domains (spanning 36 PSN sets), the best of our proposed approaches are superior to existing PSC approaches in terms of accuracy, with comparable running times. Our data and code are available at https://doi.org/10.5281/zenodo.3787922

[1]  Hongbo Mu,et al.  An ensemble approach to protein fold classification by integration of template‐based assignment and support vector machine classifier , 2016, Bioinform..

[2]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[3]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[5]  A. Godzik,et al.  Topology fingerprint approach to the inverse protein folding problem. , 1992, Journal of molecular biology.

[6]  Liisa Holm,et al.  Dali server: conservation mapping in 3D , 2010, Nucleic Acids Res..

[7]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[8]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[9]  Gabriel C Lander,et al.  Go hybrid: EM, crystallography, and beyond. , 2012, Current opinion in structural biology.

[10]  Tijana Milenkovic,et al.  Exploring the structure and function of temporal networks with dynamic graphlets , 2015, Bioinform..

[11]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[12]  Evangelia I. Zacharaki Prediction of protein function using a deep convolutional neural network ensemble (#12536) , 2017 .

[13]  Wagner Meira,et al.  Protein cutoff scanning: A comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins , 2009, Proteins.

[14]  Sunghwan Sohn,et al.  Deep learning and alternative learning strategies for retrospective real-world clinical data , 2019, npj Digital Medicine.

[15]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[16]  Christian Poellabauer,et al.  Heterogeneous Network Approach to Predict Individuals’ Mental Health , 2019, ACM Trans. Knowl. Discov. Data.

[17]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[18]  Jonathan M. Garibaldi,et al.  Supervised machine learning algorithms for protein structure classification , 2009, Comput. Biol. Chem..

[19]  Wagner Meira,et al.  Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns , 2011, BMC Genomics.

[20]  Xiaozhao Fang,et al.  Protein fold recognition based on multi-view modeling , 2019, Bioinform..

[21]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[22]  Chih-Jen Lin,et al.  A sequential dual method for large scale multi-class linear svms , 2008, KDD.

[23]  Steven E. Brenner,et al.  Alignment-free local structural search by writhe decomposition , 2007, WABI.

[24]  Engelbert Mephu Nguifo,et al.  Protein sequences classification by means of feature extraction with substitution matrices , 2010, BMC Bioinformatics.

[25]  Wouter Boomsma,et al.  Fast large-scale clustering of protein structures using Gauss integrals , 2012, Bioinform..

[26]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[27]  Ying Zhang,et al.  Class Conditional Distance Metric for 3D Protein Structure Classification , 2011, 2011 5th International Conference on Bioinformatics and Biomedical Engineering.

[28]  Natasa Przulj,et al.  GR-Align: fast and flexible alignment of protein 3D structures using graphlet degree similarity , 2014, Bioinform..

[29]  Jason Weston,et al.  SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition , 2007, BMC Bioinformatics.

[30]  Dong-Sheng Cao,et al.  protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences , 2015, Bioinform..

[31]  Jörg Menche,et al.  Interactome-based approaches to human disease , 2017 .

[32]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[33]  Scott J. Emrich,et al.  GRAFENE: Graphlet-based alignment-free network approach integrates 3D structural and sequence (residue order) data to improve protein structural comparison , 2017, Scientific Reports.

[34]  Vladimir Vacic,et al.  Graphlet Kernels for Prediction of Functional Residues in Protein Structures , 2010, J. Comput. Biol..

[35]  Evgeny B. Krissinel,et al.  On the relationship between sequence and structure similarities in proteomics , 2007, Bioinform..

[36]  Christian Poellabauer,et al.  The power of dynamic social networks to predict individuals’ mental health , 2019, PSB.

[37]  Qing Zeng-Treitler,et al.  Predicting sample size required for classification performance , 2012, BMC Medical Informatics and Decision Making.

[38]  Chao Wang,et al.  Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts , 2017, Bioinform..

[39]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[40]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[41]  L. Greene Protein structure networks. , 2012, Briefings in functional genomics.

[42]  Tijana Milenkovic,et al.  Graphlet-based edge clustering reveals pathogen-interacting proteins , 2012, Bioinform..

[43]  M. Vassura,et al.  Reconstruction of 3D Structures From Protein Contact Maps , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[45]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[46]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[47]  Santanu Kumar Rath,et al.  An efficient technique for protein classification using feature extraction by artificial neural networks , 2010, 2010 Annual IEEE India Conference (INDICON).

[48]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[49]  Jan Gorodkin,et al.  Comparing two K-category assignments by a K-category correlation coefficient , 2004, Comput. Biol. Chem..

[50]  Christian Poellabauer,et al.  Network analysis of the NetHealth data: exploring co-evolution of individuals’ social network positions and physical activities , 2018, Applied Network Science.

[51]  L SalzbergSteven On Comparing Classifiers , 1997 .

[52]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[53]  Michael Lappe,et al.  Optimized Null Model for Protein Structure Networks , 2009, PloS one.

[54]  Slobodan Kalajdziski,et al.  Protein Classification by Matching 3D Structures , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[55]  Frances M. G. Pearl,et al.  The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution , 2006, Nucleic Acids Res..

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Taeho Jo,et al.  Improving Protein Fold Recognition by Deep Learning Networks , 2015, Scientific Reports.

[58]  Michael Lappe,et al.  Optimal contact definition for reconstruction of Contact Maps , 2010, BMC Bioinformatics.

[59]  Igor Jurisica,et al.  Modeling interactome: scale-free or geometric? , 2004, Bioinform..

[60]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[61]  Edoardo M. Airoldi,et al.  Graphlet decomposition of a weighted network , 2012, AISTATS.

[62]  Jason Weston,et al.  Combining classifiers for improved classification of proteins from sequence or structure , 2008, BMC Bioinformatics.

[63]  Pasquale Petrilli Classification of protein sequences by their dipeptide composition , 1993, Comput. Appl. Biosci..

[64]  Zhen Liu,et al.  Classification of 3d Protein based on Structure Information Feature , 2008, 2008 International Conference on BioMedical Engineering and Informatics.

[65]  Peter Røgen,et al.  Evaluating protein structure descriptors and tuning Gauss integral based descriptors , 2005 .

[66]  Nikola Kasabov,et al.  Springer Handbook of Bio-/Neuro-Informatics , 2013 .

[67]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[68]  Tijana Milenkoviæ,et al.  Uncovering Biological Network Function via Graphlet Degree Signatures , 2008, Cancer informatics.

[69]  Shawn Gu,et al.  From homogeneous to heterogeneous network alignment via colored graphlets , 2017, Scientific Reports.

[70]  Hao Chen,et al.  Effective inter-residue contact definitions for accurate protein fold recognition , 2012, BMC Bioinformatics.

[71]  Tijana Milenkovic,et al.  Exploring the structure and function of temporal networks with dynamic graphlets , 2014, Bioinform..

[72]  Erliang Zeng,et al.  Genome-wide profiling of 24 hr diel rhythmicity in the water flea, Daphnia pulex: network analysis reveals rhythmic gene expression and enhances functional gene annotation , 2016, BMC Genomics.

[73]  R. Kolodny,et al.  Sequence-similar, structure-dissimilar protein pairs in the PDB , 2007, Proteins.

[74]  Hong-Liang Dai,et al.  Imbalanced Protein Data Classification Using Ensemble FTM-SVM , 2015, IEEE Transactions on NanoBioscience.