论文信息 - BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches

With the avalanche of biological sequences generated in the post-genomic age, one of the most challenging problems is how to computationally analyze their structures and functions. Machine learning techniques are playing key roles in this field. Typically, predictors based on machine learning techniques contain three main steps: feature extraction, predictor construction and performance evaluation. Although several Web servers and stand-alone tools have been developed to facilitate the biological sequence analysis, they only focus on individual step. In this regard, in this study a powerful Web server called BioSeq-Analysis (http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/) has been proposed to automatically complete the three main steps for constructing a predictor. The user only needs to upload the benchmark data set. BioSeq-Analysis can generate the optimized predictor based on the benchmark data set, and the performance measures can be reported as well. Furthermore, to maximize user's convenience, its stand-alone program was also released, which can be downloaded from http://bioinformatics.hitsz.edu.cn/BioSeq-Analysis/download/, and can be directly run on Windows, Linux and UNIX. Applied to three sequence analysis tasks, experimental results showed that the predictors generated by BioSeq-Analysis even outperformed some state-of-the-art methods. It is anticipated that BioSeq-Analysis will become a useful tool for biological sequence analysis.

Bin Liu | Bin Liu

[1] Roger W. Johnson,et al. An Introduction to the Bootstrap , 2001 .

[2] B. Liu,et al. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. , 2015, Journal of theoretical biology.

[3] Vasant Honavar,et al. Predicting flexible length linear B-cell epitopes. , 2008, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[4] Ren Long,et al. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[5] Xiang Chen,et al. The use of classification trees for bioinformatics , 2011, WIREs Data Mining Knowl. Discov..

[6] Xiaolong Wang,et al. A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[7] Minoru Kanehisa,et al. AAindex: Amino Acid index database , 2000, Nucleic Acids Res..

[8] Eric Jones,et al. SciPy: Open Source Scientific Tools for Python , 2001 .

[9] K. Chou. Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[10] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[11] Itay Mayrose,et al. Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues , 2002, ISMB.

[12] Wes McKinney,et al. pandas: a Foundational Python Library for Data Analysis and Statistics , 2011 .

[13] James G. Lyons,et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[14] Brian C. Ross. Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[15] Xiaolong Wang,et al. Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.