iTIS-PseKNC: Identification of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition

Translation is an essential genetic process for understanding the mechanism of gene expression. Due to the large number of protein sequences generated in the post-genomic era, conventional methods are unable to identify Translation Initiation Site (TIS) in human genes timely and accurately. It is thus highly desirable to develop an automatic and accurate computational model for identification of TIS. Considerable improvements have been achieved in developing computational models; however, development of accurate and reliable automated systems for TIS identification in human genes is still a challenging task. In this connection, we propose iTIS-PseKNC, a novel protocol for identification of TIS. Three protein sequence representation methods including dinucleotide composition, pseudo-dinucleotide composition and Trinucleotide composition have been used in order to extract numerical descriptors. Support Vector Machine (SVM), K-nearest neighbor and Probabilistic Neural Network are assessed for their performance using the constructed descriptors. The proposed model iTIS-PseKNC has achieved 99.40% accuracy using jackknife test. The experimental results validated the superior performance of iTIS-PseKNC over the existing methods reported in the literature. It is highly anticipated that the iTIS-PseKNC predictor will be useful for basic research studies.

[1]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[2]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[3]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[4]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[5]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[6]  James G. Lyons,et al.  Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[7]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[8]  Pengmian Feng,et al.  Prediction of DNase I Hypersensitive Sites by Using Pseudo Nucleotide Compositions , 2014, TheScientificWorldJournal.

[9]  Muhammad Tahir,et al.  PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine. , 2015, Molecular bioSystems.

[10]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[11]  W. Zhong,et al.  Molecular Science for Drug Development and Biomedicine , 2014, International journal of molecular sciences.

[12]  Weidong Xiao,et al.  Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM , 2014, BMC Bioinformatics.

[13]  Asifullah Khan,et al.  WRF-TMH: predicting transmembrane helix by fusing composition index and physicochemical properties of amino acids , 2013, Amino Acids.

[14]  Xiao-Jun Zeng,et al.  Kernel regression networks with local structural information and covariance volume adaptation , 2008, Neurocomputing.

[15]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[16]  K. Chou,et al.  iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition , 2014, BioMed research international.

[17]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[18]  Tao Jiang,et al.  A Class of Edit Kernels for SVMs to Predict Translation Initiation Sites in Eukaryotic mRNAs , 2005, J. Comput. Biol..

[19]  Asifullah Khan,et al.  Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. , 2011, Journal of theoretical biology.

[20]  R. Jackson,et al.  The mechanism of eukaryotic translation initiation and principles of its regulation , 2010, Nature Reviews Molecular Cell Biology.

[21]  Anders Gorm Pedersen,et al.  Neural Network Prediction of Translation Initiation Sites in Eukaryotes: Perspectives for EST and Genome Analysis , 1997, ISMB.

[22]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[23]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[24]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[25]  Maike Tech,et al.  An unsupervised classification scheme for improving predictions of prokaryotic TIS , 2006, BMC Bioinformatics.

[26]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[27]  Wei Chen,et al.  iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties , 2012, PloS one.

[28]  Tetsuo Nishikawa,et al.  Assessing protein coding region integrity in cDNA sequencing projects , 1998, Bioinform..

[29]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[30]  Jianding Qiu,et al.  Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform. , 2009, Analytical biochemistry.

[31]  Saeed Ahmad,et al.  Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC , 2015, Comput. Methods Programs Biomed..

[32]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[33]  Nicos G. Pavlidis,et al.  Optimizing the Performance of Probabilistic Neural Networks in a BioinformaticsTa sk , 2004 .

[34]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[35]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[36]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[37]  Sukanta Mondal,et al.  Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. , 2014, Journal of theoretical biology.

[38]  Wei Chen,et al.  iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition , 2015 .

[39]  Xiaolong Wang,et al.  Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach , 2015, Journal of biomolecular structure & dynamics.

[40]  S. O’Brien,et al.  Evaluation and Integration of Genetic Signature for Prediction Risk of Nasopharyngeal Carcinoma in Southern China , 2014, BioMed research international.

[41]  S. Khan,et al.  Prediction of protein structure classes using hybrid space of multi-profile Bayes and bi-gram probability feature spaces. , 2014, Journal of theoretical biology.

[42]  Hao Lin,et al.  Prediction of Subcellular Localization of Apoptosis Protein Using Chou’s Pseudo Amino Acid Composition , 2009, Acta biotheoretica.

[43]  Xiaolong Wang,et al.  repRNA: a web server for generating various feature vectors of RNA sequences , 2015, Molecular Genetics and Genomics.

[44]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[45]  Kuo-Chen Chou,et al.  iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. , 2015, Journal of theoretical biology.

[46]  Hong-Bin Shen,et al.  TargetFreeze: Identifying Antifreeze Proteins via a Combination of Weights using Sequence Evolutionary Information and Pseudo Amino Acid Composition , 2015, The Journal of Membrane Biology.

[47]  Maqsood Hayat,et al.  Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou's general PseAAC and Support Vector Machine , 2014, Comput. Methods Programs Biomed..

[48]  Wei Chen,et al.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. , 2014, Analytical biochemistry.

[49]  Qiuwen Zhang,et al.  MultiP-SChlo: Multi-label protein subchloroplast localization prediction , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[50]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[51]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[52]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[53]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[54]  N. Raja,et al.  ANN Approach for Weather Prediction using Back Propagation , 2012 .

[55]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[56]  K. Chou,et al.  iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins , 2013, PeerJ.

[57]  K. Chou,et al.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. , 2015, Analytical biochemistry.

[58]  Yvan Saeys,et al.  Translation initiation site prediction on a genomic scale: beauty in simplicity , 2007, ISMB/ECCB.

[59]  K. Chou,et al.  iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model , 2015, Journal of biomolecular structure & dynamics.

[60]  M. A. Jordan,et al.  Acidophilic bacteria : Their potential mining and environmental applications , 1996 .

[61]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[62]  Artemis G. Hatzigeorgiou,et al.  Translation initiation start prediction in human cDNAs with high accuracy , 2002, Bioinform..

[63]  S. Paul Sathiyan,et al.  Neural Network based ACC for Optimized safety and comfort , 2012 .

[64]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[65]  Manish Kumar,et al.  Prediction of β-lactamase and its class by Chou's pseudo-amino acid composition and support vector machine. , 2015, Journal of theoretical biology.

[66]  Zaheer Ullah Khan,et al.  Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. , 2015, Journal of theoretical biology.

[67]  Yonghong Wang,et al.  Recognition of Translation Initiation Sites of Eukaryotic Genes Based on an EM Algorithm , 2003, J. Comput. Biol..

[68]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[69]  Maqsood Hayat,et al.  Discriminating outer membrane proteins with Fuzzy K-nearest Neighbor algorithms based on the general form of Chou's PseAAC. , 2012, Protein and peptide letters.

[70]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[71]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..