BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems is how to computationally analyze their structures and functions. Machine learning techniques are playing key roles in this field. Typically, predictors based on machine learning techniques contain three main steps: feature extraction, predictor construction and performance evaluation. Although several Web servers and stand-alone tools have been developed to facilitate the biological sequence analysis, they only focus on individual step. In this regard, in this study a powerful Web server called BioSeq-Analysis (http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/) has been proposed to automatically complete the three main steps for constructing a predictor. The user only needs to upload the benchmark data set. BioSeq-Analysis can generate the optimized predictor based on the benchmark data set, and the performance measures can be reported as well. Furthermore, to maximize user's convenience, its stand-alone program was also released, which can be downloaded from http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/download/, and can be directly run on Windows, Linux and UNIX. Applied to three sequence analysis tasks, experimental results showed that the predictors generated by BioSeq-Analysis even outperformed some state-of-the-art methods. It is anticipated that BioSeq-Analysis will become a useful tool for biological sequence analysis.

[1]  Roger W. Johnson,et al.  An Introduction to the Bootstrap , 2001 .

[2]  B. Liu,et al.  Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. , 2015, Journal of theoretical biology.

[3]  Vasant Honavar,et al.  Predicting flexible length linear B-cell epitopes. , 2008, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[4]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[5]  Xiang Chen,et al.  The use of classification trees for bioinformatics , 2011, WIREs Data Mining Knowl. Discov..

[6]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[7]  Minoru Kanehisa,et al.  AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[8]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[9]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[10]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[11]  Itay Mayrose,et al.  Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[12]  Wes McKinney,et al.  pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[13]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[14]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[15]  Xiaolong Wang,et al.  Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.

[16]  B. Efron © Institute of Mathematical Statistics, 2003 Second Thoughts on the Bootstrap , 2022 .

[17]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[18]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[19]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[20]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[21]  James G. Lyons,et al.  SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks. , 2017, Methods in molecular biology.

[22]  David S. Wishart,et al.  Applications of Machine Learning in Cancer Prediction and Prognosis , 2006, Cancer informatics.

[23]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[24]  Xuan Liu,et al.  Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning , 2016, IEEE Transactions on NanoBioscience.

[25]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[26]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[27]  Bin Liu,et al.  Protein fold recognition based on sparse representation based classification , 2017, Artif. Intell. Medicine.

[28]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[29]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[30]  N. Ben-Tal,et al.  The ConSurf‐HSSP database: The mapping of evolutionary conservation among homologs onto PDB structures , 2004, Proteins.

[31]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[32]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[33]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[34]  Xiaolong Wang,et al.  repRNA: a web server for generating various feature vectors of RNA sequences , 2015, Molecular Genetics and Genomics.

[35]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[36]  R. Sokal,et al.  Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. , 2006, American journal of physical anthropology.

[37]  Pengmian Feng,et al.  Prediction of DNase I Hypersensitive Sites by Using Pseudo Nucleotide Compositions , 2014, TheScientificWorldJournal.

[38]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[39]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[40]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[41]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[42]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[43]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[44]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[45]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[46]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[47]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[48]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[49]  Kuo-Chen Chou,et al.  pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC , 2016, Bioinform..

[50]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.

[51]  Yue Gao,et al.  Improved and promising identification of human MicroRNAs by incorporating a high-quality negative set , 2014, TCBB.

[52]  Wei Chen,et al.  The organization of nucleosomes around splice sites , 2010, Nucleic acids research.

[53]  Kuo-Chen Chou,et al.  2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function , 2017, Molecular therapy. Nucleic acids.

[54]  Liaofu Luo,et al.  Splice site prediction with quadratic discriminant analysis using diversity measure. , 2003, Nucleic acids research.

[55]  Junjie Chen,et al.  ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank , 2017, Bioinform..

[56]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[57]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[58]  Junjie Chen,et al.  Protein remote homology detection based on bidirectional long short-term memory , 2017, BMC Bioinformatics.

[59]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[60]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[61]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[62]  William Stafford Noble,et al.  Predicting Human Nucleosome Occupancy from Primary Sequence , 2008, PLoS Comput. Biol..

[63]  Jia Liu,et al.  Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. , 2012, Journal of theoretical biology.

[64]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[65]  Junjie Chen,et al.  iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions , 2016, Scientific Reports.

[66]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[67]  Xiaolong Wang,et al.  Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile‐Based Protein Representation , 2013, Molecular informatics.

[68]  Xiaolong Wang,et al.  Prediction of protein binding sites in protein structures using hidden Markov support vector machine , 2009, BMC Bioinformatics.

[69]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[70]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[71]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[72]  Christopher S. Oehmen,et al.  Physicochemical property distributions for accurate and rapid pairwise protein homology detection , 2010, BMC Bioinformatics.

[73]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[74]  Sarah C. R. Elgin,et al.  The chromatin structure of specific genes: I. Evidence for higher order domains of defined DNA sequence , 1979, Cell.

[75]  Gang Tian,et al.  Accurate Prediction of Transposon-Derived piRNAs by Integrating Various Sequential and Physicochemical Features , 2016, PloS one.

[76]  Stephen V. Stehman,et al.  Selecting and interpreting measures of thematic classification accuracy , 1997 .

[77]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[78]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[79]  B. Liu,et al.  Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection , 2012, PloS one.

[80]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[81]  Xiaolong Wang,et al.  A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction , 2019, Briefings Bioinform..

[82]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[83]  Wei Chen,et al.  iRNA-PseU: Identifying RNA pseudouridine sites , 2016, Molecular therapy. Nucleic acids.

[84]  Liaofu Luo,et al.  Prediction for human transcription start site using diversity measure with quadratic discriminant , 2008, Bioinformation.

[85]  B. Liu,et al.  Recombination spot identification Based on gapped k-mers , 2016, Scientific Reports.

[86]  Swetlana Nikolajewa,et al.  DiProDB: a database for dinucleotide properties , 2008, Nucleic Acids Res..

[87]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[88]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[89]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[90]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[91]  B. Liu,et al.  iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance , 2016, Scientific Reports.

[92]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[93]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[94]  Lukasz A. Kurgan,et al.  A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues , 2016, Briefings Bioinform..

[95]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[96]  D. Horne,et al.  Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities , 1988, Biopolymers.

[97]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[98]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[99]  Qin Lu,et al.  EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation , 2017, BMC Bioinformatics.

[100]  B. Liu,et al.  PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation , 2017, International journal of molecular sciences.

[101]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[102]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[103]  Xiaolong Wang,et al.  Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation , 2015, BMC Systems Biology.

[104]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[105]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[106]  Xiaolong Wang,et al.  Protein-protein interaction site prediction based on conditional random fields , 2007, Bioinform..

[107]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[108]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[109]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[110]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[111]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[112]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.