An in silico approach to identification, categorization and prediction of nucleic acid binding proteins

The interaction between proteins and nucleic acid plays an important role in many processes, such as transcription, translation and DNA repair. The mechanisms of related biological events can be understood by exploring the function of proteins in these interactions. The number of known protein sequences has increased rapidly in recent years, but the databases for describing the structure and function of protein have unfortunately grown quite slowly. Thus, improving such databases is meaningful for predicting protein-nucleic acid interactions. Furthermore, the mechanism of related biological events, such as viral infection or designing novel drug targets, can be further understood by understanding the function of proteins in these interactions. The information for each sequence, including its function and interaction sites, were collected and identified, and a database called PNIDB was built. The proteins in PNIDB were grouped into 27 classes, such as transcription, immune system, and structural protein, etc. The function of each protein was then predicted using a machine learning method. Using our method, the predictor was trained on labeled sequences, and then the function of a protein was predicted based on the trained classifier. The prediction accuracy achieved a score of 77.43% by 10-fold cross validation. Availability and Implementation PNIDB is now fully working and can be freely accessed at: http://server.malab.cn/PNIDB/index.html. All the data are publicly available for non-commercial use, distribution, and reproduction in any medium. Contact zouquan@nclab.net

[1]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[2]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[3]  Quan Zou,et al.  A Review of DNA-binding Proteins Prediction Methods , 2019, Current Bioinformatics.

[4]  Francisco Melo,et al.  The Protein-DNA Interface database , 2010, BMC Bioinformatics.

[5]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[6]  J L Sussman,et al.  Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. , 1998, Acta crystallographica. Section D, Biological crystallography.

[7]  Neil D. Rawlings,et al.  New mini- zincin structures provide a minimal scaffold for members of this metallopeptidase superfamily , 2014, BMC Bioinformatics.

[8]  Guangmin Liang,et al.  An Efficient Classifier for Alzheimer’s Disease Genes Identification , 2018, Molecules.

[9]  Guangmin Liang,et al.  k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification , 2019, Front. Genet..

[10]  Sergei A. Spirin,et al.  NPIDB: nucleic acid—protein interaction database , 2012, Nucleic Acids Res..

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  Nicholas B Rego,et al.  3Dmol.js: molecular visualization with WebGL , 2014, Bioinform..

[13]  Guangmin Liang,et al.  A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides , 2018, Genes.

[14]  Silvio C. E. Tosatto,et al.  InterPro in 2017—beyond protein family and domain annotations , 2016, Nucleic Acids Res..

[15]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[16]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[17]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[18]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[19]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[20]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[21]  J. Thornton,et al.  NUCPLOT: a program to generate schematic diagrams of protein-nucleic acid interactions. , 1997, Nucleic acids research.

[22]  Guangmin Liang,et al.  SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins , 2018, International journal of molecular sciences.

[23]  Maria Jesus Martin,et al.  SIFTS: Structure Integration with Function, Taxonomy and Sequences resource , 2012, Nucleic Acids Res..

[24]  Zukang Feng,et al.  RCSB Protein Data Bank: Sustaining a living digital data resource that enables breakthroughs in scientific research and biomedical education , 2017, Protein science : a publication of the Protein Society.

[25]  Guoqiang Han,et al.  HOGMMNC: a higher order graph matching with multiple network constraints model for gene‐drug regulatory modules identification , 2018, Bioinform..

[26]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[27]  Eugene Baulin,et al.  An updated version of NPIDB includes new classifications of DNA–protein complexes and their families , 2015, Nucleic Acids Res..

[28]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[29]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[30]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..