Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy

BackgroundIt is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value. TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcription regulation. Our study proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently. This method would guide for the special protein identification with computational intelligence strategies.ResultsFirstly, we proposed novel fingerprint features for TBP based on pseudo amino acid composition, physicochemical properties, and secondary structure. Secondly, hierarchical features dimensionality reduction strategies were employed to improve the performance furthermore. Currently, Pretata achieves 92.92% TATA-binding protein prediction accuracy, which is better than all other existing methods.ConclusionsThe experiments demonstrate that our method could greatly improve the prediction accuracy and speed, thus allowing large-scale NGS data prediction to be practical. A web server is developed to facilitate the other researchers, which can be accessed at http://server.malab.cn/preTata/.

[1]  Dong-Sheng Cao,et al.  protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences , 2015, Bioinform..

[2]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[3]  Hui Liu,et al.  Improving compound–protein interaction prediction by building up highly credible negative samples , 2015, Bioinform..

[4]  Q. Zou,et al.  A Global Characterization and Identification of Multifunctional Enzymes , 2012, PloS one.

[5]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[6]  Q. Zou,et al.  Similarity computation strategies in the microRNA-disease network: a survey. , 2015, Briefings in functional genomics.

[7]  Shan Gao,et al.  Identification of properties important to protein aggregation using feature selection , 2013, BMC Bioinformatics.

[8]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[9]  R. Kornberg The molecular basis of eukaryotic transcription , 2007, Proceedings of the National Academy of Sciences.

[10]  Liang Liang,et al.  CytoSVM: an advanced server for identification of cytokine-receptor interactions , 2007, Nucleic Acids Res..

[11]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[12]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Yanjun Liu,et al.  Computational power of tissue P systems for generating control languages , 2014, Inf. Sci..

[14]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[16]  Suxian Cai,et al.  Adaptive Linear and Normalized Combination of Radial Basis Function Networks for Function Approximation and Regression , 2014 .

[17]  Sridhar Krishnan,et al.  Effective Dysphonia Detection Using Feature Dimension Reduction and Kernel Density Estimation for Patients with Parkinson’s Disease , 2014, PloS one.

[18]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[19]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[20]  Xiangxiang Zeng,et al.  Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks , 2016, Briefings Bioinform..

[21]  Junjie Chen,et al.  Application of learning to rank to protein remote homology detection , 2015, Bioinform..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Quan Zou,et al.  Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features , 2014 .

[24]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Ying Ju,et al.  Improving tRNAscan‐SE Annotation Results via Ensemble Classifiers , 2015, Molecular informatics.

[28]  B. Liu,et al.  Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. , 2015, Journal of theoretical biology.

[29]  Y. Z. Chen,et al.  Prediction of transporter family from protein sequence by support vector machine approach , 2005, Proteins.

[30]  Bruschi,et al.  Classification of , 2010 .

[31]  Q. Zou,et al.  Representation of fluctuation features in pathological knee joint vibroarthrographic signals using kernel density modeling method. , 2014, Medical engineering & physics.

[32]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[33]  Ning Xiang,et al.  Classification of Knee Joint Vibration Signals Using Bivariate Feature Distribution Estimation and Maximal Posterior Probability Decision Criterion , 2013, Entropy.

[34]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[35]  Xiaolong Wang,et al.  Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation , 2015, BMC Systems Biology.

[36]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[37]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[38]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[39]  Yi Jiang,et al.  BinMemPredict: a Web Server and Software for Predicting Membrane Protein Types , 2013 .

[40]  Xing Gao,et al.  An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information , 2015, IEEE Transactions on NanoBioscience.

[41]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[42]  B. Liu,et al.  Recombination spot identification Based on gapped k-mers , 2016, Scientific Reports.

[43]  C Y Wang,et al.  imDC: an ensemble learning method for imbalanced classification with miRNA data. , 2015, Genetics and molecular research : GMR.

[44]  Shenghai Zhang,et al.  A GEE approach for estimating size of hard-to-reach population by using capture–recapture data , 2012 .

[45]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[46]  Junjie Chen,et al.  iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions , 2016, Scientific Reports.

[47]  Sridhar Krishnan,et al.  Combining least-squares support vector machines for classification of biomedical signals: a case study with knee-joint vibroarthrographic signals , 2011, J. Exp. Theor. Artif. Intell..

[48]  Xiangxiang Zeng,et al.  Spiking Neural P Systems with Thresholds , 2014, Neural Computation.

[49]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..