k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification

In this paper, a computational method based on machine learning technique for identifying Alzheimer's disease genes is proposed. Compared with most existing machine learning based methods, existing methods predict Alzheimer's disease genes by using structural magnetic resonance imaging (MRI) technique. Most methods have attained acceptable results, but the cost is expensive and time consuming. Thus, we proposed a computational method for identifying Alzheimer disease genes by use of the sequence information of proteins, and classify the feature vectors by random forest. In the proposed method, the gene protein information is extracted by adaptive k-skip-n-gram features. The proposed method can attain the accuracy to 85.5% on the selected UniProt dataset, which has been demonstrated by the experimental results.

[1]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[2]  Jiajie Peng,et al.  Measuring phenotype-phenotype similarity through the interactome , 2017, BMC Bioinformatics.

[3]  Kathryn Ziegler-Graham,et al.  Forecasting the global burden of Alzheimer’s disease , 2007, Alzheimer's & Dementia.

[4]  Q. Zou,et al.  SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides , 2017, BMC Genomics.

[5]  Xiangxiang Zeng,et al.  Small universal simple spiking neural P systems with weights , 2013, Science China Information Sciences.

[6]  Lusheng Wang,et al.  Protein-Protein Binding Sites Prediction by 3D Structural Similarities , 2011, J. Chem. Inf. Model..

[7]  Jiajie Peng,et al.  InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk , 2018, BMC Genomics.

[8]  Xiaofeng Li,et al.  ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies , 2019, Briefings Bioinform..

[9]  Guangmin Liang,et al.  SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins , 2018, International journal of molecular sciences.

[10]  Man Wu,et al.  A genome-wide analysis of the small auxin-up RNA (SAUR) gene family in cotton , 2017, BMC Genomics.

[11]  Dariusz Mrozek,et al.  Alignment of protein structure energy patterns represented as sequences of Fuzzy Numbers , 2009, NAFIPS 2009 - 2009 Annual Meeting of the North American Fuzzy Information Processing Society.

[12]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[13]  Jingpu Zhang,et al.  Integrating Multiple Heterogeneous Networks for Novel LncRNA-Disease Association Inference , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Andrzej Swierniak,et al.  The Energy Distribution Data Bank: Collecting Energy Features of Protein Molecular Structures , 2009, 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering.

[15]  Yongshuai Jiang,et al.  PICALM rs3851179 Variant Confers Susceptibility to Alzheimer’s Disease in Chinese Population , 2016, Molecular Neurobiology.

[16]  Jingpu Zhang,et al.  KATZLGO: Large-Scale Prediction of LncRNA Functions by Using the KATZ Measure Based on Multiple Networks , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Bo Li,et al.  NOREVA: normalization and evaluation of MS-based metabolomics data , 2017, Nucleic Acids Res..

[18]  B. Liu,et al.  PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation , 2017, International journal of molecular sciences.

[19]  Yang Hu,et al.  Disease status affects the association between rs4813620 and the expression of Alzheimer’s disease susceptibility gene TRIB3 , 2018, Proceedings of the National Academy of Sciences.

[20]  Jijun Tang,et al.  Identification of drug-target interactions via multiple information integration , 2017, Inf. Sci..

[21]  Guangmin Liang,et al.  A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides , 2018, Genes.

[22]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[23]  Feng Zhu,et al.  Discovery of the Consistently Well-Performed Analysis Chain for SWATH-MS Based Pharmacoproteomic Quantification , 2018, Front. Pharmacol..

[24]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[25]  Yongshuai Jiang,et al.  Alzheimer’s Disease Variants with the Genome-Wide Significance are Significantly Enriched in Immune Pathways and Active in Immune Cells , 2015, Molecular Neurobiology.

[26]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[27]  Xiangxiang Zeng,et al.  Prediction of potential disease-associated microRNAs using structural perturbation method , 2017, bioRxiv.

[28]  Jijun Tang,et al.  Identification of Protein–Protein Interactions via a Novel Matrix-Based Sequence Representation Model with Amino Acid Contact Information , 2016, International journal of molecular sciences.

[29]  Xiaofeng Liu,et al.  Developing a Multi-Dose Computational Model for Drug-Induced Hepatotoxicity Prediction Based on Toxicogenomics Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Xiaolong Wang,et al.  Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.

[31]  Shuhui Liu,et al.  Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach , 2018, BMC Systems Biology.

[32]  Jiangning Song,et al.  Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms , 2018, Briefings Bioinform..

[33]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[34]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[35]  Alina Momot,et al.  Improving Performance of Protein Structure Similarity Searching by Distributing Computations in Hierarchical Multi-Agent System , 2010, ICCCI.

[36]  Xiaolong Wang,et al.  A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction , 2019, Briefings Bioinform..

[37]  Bin Liu,et al.  HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search , 2018, Briefings Bioinform..

[38]  Lusheng Wang,et al.  Probabilistic Models for Capturing More Physicochemical Properties on Protein-Protein Interface , 2014, J. Chem. Inf. Model..

[39]  Yadong Wang,et al.  A novel method to measure the semantic similarity of HPO terms , 2017, Int. J. Data Min. Bioinform..

[40]  Qinghua Guo,et al.  LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse , 2018, Nucleic Acids Res..

[41]  Zhenbing Zeng,et al.  Exact safety verification of hybrid systems using sums-of-squares representation , 2011, Science China Information Sciences.

[42]  Q. Zou,et al.  Similarity computation strategies in the microRNA-disease network: a survey. , 2015, Briefings in functional genomics.

[43]  Qinghua Jiang,et al.  Alzheimer's Disease rs11767557 Variant Regulates EPHA1 Gene Expression Specifically in Human Whole Blood. , 2018, Journal of Alzheimer's disease : JAD.

[44]  Lusheng Wang,et al.  Protein-protein binding site identification by enumerating the configurations , 2012, BMC Bioinformatics.

[45]  Junjie Chen,et al.  iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions , 2016, Scientific Reports.

[46]  Feng Zhu,et al.  Clinical Success of Drug Targets Prospectively Predicted by In Silico Study. , 2017, Trends in pharmacological sciences.

[47]  Wei Lin,et al.  A comprehensive overview and evaluation of circular RNA detection tools , 2017, PLoS Comput. Biol..

[48]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[49]  Jijun Tang,et al.  Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier , 2017, J. Chem. Inf. Model..

[50]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[51]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[52]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[53]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[54]  Yuzong Chen,et al.  What Contributes to Serotonin-Norepinephrine Reuptake Inhibitors' Dual-Targeting Mechanism? The Key Role of Transmembrane Domain 6 in Human Serotonin and Norepinephrine Transporters Revealed by Molecular Dynamics Simulation. , 2018, ACS chemical neuroscience.

[55]  Alfonso Rodríguez-Patón,et al.  Meta-Path Methods for Prioritizing Candidate Disease miRNAs , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[56]  Feng Zhu,et al.  Performance Evaluation and Online Realization of Data-driven Normalization Methods Used in LC/MS based Untargeted Metabolomics Analysis , 2016, Scientific Reports.

[57]  Hao Wang,et al.  Enhanced Prediction of Hot Spots at Protein-Protein Interfaces Using Extreme Gradient Boosting , 2018, Scientific Reports.

[58]  Jun Zhang,et al.  Identifying diseases-related metabolites using random walk , 2018, BMC Bioinformatics.

[59]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[60]  Xing Gao,et al.  Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites , 2019, Neurocomputing.

[61]  Xiangxiang Zeng,et al.  Probability-based collaborative filtering model for predicting gene–disease associations , 2017, BMC Medical Genomics.

[62]  Zeng Xiangxiang,et al.  A Classification Method for Microarrays Based on Diversity , 2016 .

[63]  Zhigang Chen,et al.  An Integrated Framework for Functional Annotation of Protein Structural Domains , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[64]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[65]  Xiangxiang Zeng,et al.  Spiking Neural P Systems With Colored Spikes , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[66]  Bin Liu,et al.  ProtDet-CCH: Protein Remote Homology Detection by Combining Long Short-Term Memory and Ranking Methods , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[67]  Feng Zhu,et al.  Determining the Balance Between Drug Efficacy and Safety by the Network and Biological System Profile of Its Therapeutic Target , 2018, Front. Pharmacol..