Predicting protein subcellular location with network embedding and enrichment features.

The subcellular location of a protein is highly related to its function. Identifying the location of a given protein is an essential step for investigating its related problems. Traditional experimental methods can produce solid determination. However, their limitations, such as high cost and low efficiency, are evident. Computational methods provide an alternative means to address these problems. Most previous methods constantly extract features from protein sequences or structures for building prediction models. In this study, we use two types of features and combine them to construct the model. The first feature type is extracted from a protein-protein interaction network to abstract the relationship between the encoded protein and other proteins. The second type is obtained from gene ontology and biological pathways to indicate the existing functions of the encoded protein. These features are analyzed using some feature selection methods. The final optimum features are adopted to build the model with recurrent neural network as the classification algorithm. Such model yields good performance with Matthews correlation coefficient of 0.844. A decision tree is used as a rule learning classifier to extract decision rules. Although the performance of decision rules is poor, they are valuable in revealing the molecular mechanism of proteins with different subcellular locations. The final analysis confirms the reliability of the extracted rules. The source code of the propose method is freely available at https://github.com/xypan1232/rnnloc.

[1]  K. Struhl,et al.  The transition from transcriptional initiation to elongation. , 2008, Current opinion in genetics & development.

[2]  Jing Lu,et al.  A similarity-based method for prediction of drug side effects with heterogeneous information. , 2018, Mathematical biosciences.

[3]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[4]  I. Shih,et al.  A BTB/POZ protein, NAC-1, is related to tumor recurrence and is essential for tumor growth and survival , 2006, Proceedings of the National Academy of Sciences.

[5]  E. O’Shea,et al.  Global analysis of protein localization in budding yeast , 2003, Nature.

[6]  M. Schachner,et al.  Enhanced novelty‐induced activity, reduced anxiety, delayed resynchronization to daylight reversal and weaker muscle strength in tenascin‐C‐deficient mice , 2006, The European journal of neuroscience.

[7]  Hongyuan Zha,et al.  Multi-Graph Matching via Affinity Optimization with Graduated Consistency Regularization , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  C. Auffray,et al.  M19 Modulates Skeletal Muscle Differentiation and Insulin Secretion in Pancreatic β-Cells through Modulation of Respiratory Chain Activity , 2012, PloS one.

[9]  I. Shih,et al.  Expression and clinical role of the bric-a-brac tramtrack broad complex/poxvirus and zinc protein NAC-1 in ovarian carcinoma effusions. , 2007, Human pathology.

[10]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  T. Ke,et al.  Mutation in Nuclear Pore Component NUP155 Leads to Atrial Fibrillation and Early Sudden Cardiac Death , 2008, Cell.

[12]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[13]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[14]  H. J. Sips,et al.  Immunoelectron microscopical localization of lysosomal β-galactosidase and its precursor forms in normal and mutant human fibroblasts , 1986 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  B. Cravatt,et al.  PLD3 and PLD4 are single stranded acid exonucleases that regulate endosomal nucleic acid sensing , 2018, Nature Immunology.

[17]  Cole M. Haynes,et al.  Proteasomal adaptation to environmental stress links resistance to proteotoxicity with longevity in Caenorhabditis elegans , 2008, Proceedings of the National Academy of Sciences.

[18]  Hongqiao Li,et al.  Mammalian APH-1 Interacts with Presenilin and Nicastrin and Is Required for Intramembrane Proteolysis of Amyloid-β Precursor Protein and Notch* , 2002, The Journal of Biological Chemistry.

[19]  T. Nishimoto,et al.  RagA is a functional homologue of S. cerevisiae Gtr1p involved in the Ran/Gsp1-GTPase pathway. , 1998, Journal of cell science.

[20]  I. Braunstein,et al.  Signal-peptide-mediated translocation is regulated by a p97-AIRAPL complex. , 2014, The Biochemical journal.

[21]  Michael T. Hallett,et al.  Refining Protein Subcellular Localization , 2005, PLoS Comput. Biol..

[22]  H. W. Beams,et al.  The Golgi apparatus: structure and function. , 1968, International review of cytology.

[23]  C. Cremers,et al.  Missense mutations in POU4F3 cause autosomal dominant hearing impairment DFNA15 and affect subcellular localization and DNA binding , 2008, Human mutation.

[24]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[25]  A. Aguzzi,et al.  Hypermyelination and demyelinating peripheral neuropathy in Pmp22-deficient mice , 1995, Nature Genetics.

[26]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[27]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[28]  Hong Wang,et al.  PEN-2 and APH-1 Coordinately Regulate Proteolytic Processing of Presenilin 1* , 2003, The Journal of Biological Chemistry.

[29]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[30]  Xiaoyong Pan,et al.  Gene expression differences among different MSI statuses in colorectal cancer , 2018, International journal of cancer.

[31]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[32]  Zhen Cao,et al.  The lncLocator: a subcellular localization predictor for long non‐coding RNAs based on a stacked ensemble classifier , 2018, Bioinform..

[33]  Arne Elofsson,et al.  SubCons: a new ensemble method for improved human subcellular localization predictions , 2017, Bioinform..

[34]  S. Suresh,et al.  Cell and molecular mechanics of biological materials , 2003, Nature materials.

[35]  Shwu‐Yuan Wu,et al.  Isolation of Mouse TFIID and Functional Characterization of TBP and TFIID in Mediating Estrogen Receptor and Chromatin Transcription* , 1999, The Journal of Biological Chemistry.

[36]  Jialiang Yang,et al.  Identify Key Sequence Features to Improve CRISPR sgRNA Efficacy , 2017, IEEE Access.

[37]  J. Carazo,et al.  GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists , 2007, Genome Biology.

[38]  David M. Sabatini,et al.  The Rag GTPases Bind Raptor and Mediate Amino Acid Signaling to mTORC1 , 2008, Science.

[39]  Hongbin Shen,et al.  Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. , 2010, Journal of proteome research.

[40]  Lei Chen,et al.  Predicting protein subcellular location using learned distributed representations from a protein-protein network , 2019, bioRxiv.

[41]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[42]  Hong-Bin Shen,et al.  Hum‐mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features , 2016, Bioinform..

[43]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[44]  A. Lamond,et al.  Nuclear substructure and dynamics , 2003, Current Biology.

[45]  B. Henderson,et al.  Regulation of tumor suppressors by nuclear-cytoplasmic shuttling. , 2003, Experimental cell research.

[46]  K. Ashe,et al.  Tau Mislocalization to Dendritic Spines Mediates Synaptic Dysfunction Independently of Neurodegeneration , 2010, Neuron.

[47]  Shuaiqun Wang,et al.  Drug target group prediction with multiple drug networks. , 2020, Combinatorial chemistry & high throughput screening.

[48]  Huan Liu,et al.  Incremental Feature Selection , 1998, Applied Intelligence.

[49]  K. Matsumoto,et al.  Tenascin-X expression in tumor cells and fibroblasts: glucocorticoids as negative regulators in fibroblasts. , 1996, Journal of cell science.

[50]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[51]  Lei Chen,et al.  iATC-NRAKEL: an efficient multi-label classifier for recognizing anatomical therapeutic chemical classes of drugs , 2019, Bioinform..

[52]  Pamela A. Silver,et al.  Nuclear transport and cancer: from mechanism to intervention , 2004, Nature Reviews Cancer.

[53]  N. Galjart,et al.  Alternative splicing of beta-galactosidase mRNA generates the classic lysosomal enzyme and a beta-galactosidase-related protein. , 1989, The Journal of biological chemistry.

[54]  M. Tyers,et al.  From genomics to proteomics , 2003, Nature.

[55]  H. Moser,et al.  Human PEX7 encodes the peroxisomal PTS2 receptor and is responsible for rhizomelic chondrodysplasia punctata , 1997, Nature Genetics.

[56]  M. Hung,et al.  Cytoplasmic/Nuclear Shuttling and Tumor Progression , 2005, Annals of the New York Academy of Sciences.

[57]  T. Rapoport Protein translocation across the eukaryotic endoplasmic reticulum and bacterial plasma membranes , 2007, Nature.

[58]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[59]  D. Klionsky,et al.  Vacuolar import of proteins and organelles from the cytoplasm. , 1999, Annual review of cell and developmental biology.

[60]  Lei Chen,et al.  Predicting Drug Side Effects with Compact Integration of Heterogeneous Networks , 2019 .

[61]  Michelle S. Scott,et al.  Predicting subcellular localization via protein motif co-occurrence. , 2004, Genome research.

[62]  Y. Shinoda,et al.  Phospholipase D Family Member 4, a Transmembrane Glycoprotein with No Phospholipase D Activity, Expression in Spleen and Early Postnatal Microglia , 2010, PloS one.

[63]  H. Lodish Molecular Cell Biology , 1986 .

[64]  U. Suter,et al.  The peripheral myelin protein 22 and epithelial membrane protein family. , 2000, Progress in nucleic acid research and molecular biology.

[65]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[66]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[67]  M. Bornens,et al.  Characterization of GAPCenA, a GTPase activating protein for Rab6, part of which associates with the centrosome , 1999, The EMBO journal.

[68]  Wonhwa Cho,et al.  Membrane-protein interactions in cell signaling and membrane trafficking. , 2005, Annual review of biophysics and biomolecular structure.

[69]  H. Endo,et al.  Association of a novel mitochondrial protein M19 with mitochondrial nucleoids. , 2009, Journal of biochemistry.

[70]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[71]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.