Ensemble Consensus-Guided Unsupervised Feature Selection to Identify Huntington’s Disease-Associated Genes

Due to the complexity of the pathological mechanisms of neurodegenerative diseases, traditional differentially-expressed gene selection methods cannot detect disease-associated genes accurately. Recent studies have shown that consensus-guided unsupervised feature selection (CGUFS) performs well in feature selection for identifying disease-associated genes. Since the random initialization of the feature selection matrix in CGUFS results in instability of the final disease-associated gene set, for the purposes of this study we proposed an ensemble method based on CGUFS—namely, ensemble consensus-guided unsupervised feature selection (ECGUFS) in order to further improve the accuracy of disease-associated genes and the stability of feature gene sets. We also proposed a bagging integration strategy to integrate the results of CGUFS. Lastly, we conducted experiments with Huntington’s disease RNA sequencing (RNA-Seq) data and obtained the final feature gene set, where we detected 287 disease-associated genes. Enrichment analysis on these genes has shown that postsynaptic density and the postsynaptic membrane, synapse, and cell junction are all affected during the disease’s progression. However, ECGUFS greatly improved the accuracy of disease-associated gene prediction and the stability of the disease-associated gene set. We conducted a classification of samples with labels based on the linear support vector machine with 10-fold cross-validation. The average accuracy is 0.9, which suggests the effectiveness of the feature gene set.

[1]  Ruth Luthi-Carter,et al.  Complex alteration of NMDA receptors in transgenic Huntington's disease mouse brain: analysis of mRNA and protein expression, plasma membrane association, interacting proteins, and phosphorylation , 2003, Neurobiology of Disease.

[2]  Guojun Bu,et al.  Dysregulation of protein trafficking in neurodegeneration , 2014, Molecular Neurodegeneration.

[3]  LiHongzhe,et al.  Co-expression networks , 2010 .

[4]  Eric H Kim,et al.  New Perspectives on the Neuropathology in Huntington's Disease in the Human Brain and its Relation to Symptom Variation. , 2012, Journal of Huntington's disease.

[5]  Xing-Ming Zhao,et al.  jNMFMA: a joint non-negative matrix factorization meta-analysis of transcriptomics data , 2015, Bioinform..

[6]  Lior Wolf,et al.  Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Zhao Zhang,et al.  Flexible Non-Negative Matrix Factorization to Unravel Disease-Related Genes , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Hyung-Jun Kim,et al.  Prion-like Mechanism in Amyotrophic Lateral Sclerosis: are Protein Aggregates the Key? , 2014, Experimental neurobiology.

[9]  R. Wurtman,et al.  Biomarkers in the diagnosis and management of Alzheimer's disease. , 2015, Metabolism: clinical and experimental.

[10]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[11]  Wei Chen,et al.  Detecting N6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines , 2017, Scientific Reports.

[12]  Z. Yue,et al.  Neuronal aggregates: formation, clearance, and spreading. , 2015, Developmental cell.

[13]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[14]  S. W. Davies,et al.  Aggregation of huntingtin in neuronal intranuclear inclusions and dystrophic neurites in brain. , 1997, Science.

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  C A Ross,et al.  Decreased expression of striatal signaling genes in a mouse model of Huntington's disease. , 2000, Human molecular genetics.

[17]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[18]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[19]  Ying Ju,et al.  Improving tRNAscan‐SE Annotation Results via Ensemble Classifiers , 2015, Molecular informatics.

[20]  Ming Shao,et al.  Consensus Guided Unsupervised Feature Selection , 2016, AAAI.

[21]  Feng Duan,et al.  Identify Huntington’s disease associated genes based on restricted Boltzmann machine with RNA-seq data , 2017, BMC Bioinformatics.

[22]  Hongzhe Li,et al.  Co-expression networks: graph properties and topological comparisons , 2010, Bioinform..

[23]  J. O'Callaghan,et al.  Biomarkers of Parkinson's disease: present and future. , 2015, Metabolism: clinical and experimental.

[24]  Mansoor M Amiji,et al.  Challenges and opportunities in CNS delivery of therapeutics for neurodegenerative diseases , 2009, Expert opinion on drug delivery.

[25]  Giovanni Coppola,et al.  Integrated genomics and proteomics to define huntingtin CAG length-dependent networks in HD Mice , 2016, Nature Neuroscience.

[26]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[27]  Daniel Bottomly,et al.  Utilizing RNA-Seq data for de novo coexpression network inference , 2012, Bioinform..

[28]  Aiqing He,et al.  Systems genetics analysis of gene-by-environment interactions in human cells. , 2010, American journal of human genetics.

[29]  Rainer Breitling,et al.  A comparison of meta-analysis methods for detecting differentially expressed genes in microarray experiments , 2008, Bioinform..

[30]  Hui Xiong,et al.  A Theoretic Framework of K-Means-Based Consensus Clustering , 2013, IJCAI.

[31]  C. Schuldt,et al.  Recognizing Human Action : A Local SVM Approach , 2004 .

[32]  Randall Bateman,et al.  Alzheimer's disease and other dementias: advances in 2014 , 2015, The Lancet Neurology.

[33]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[34]  Han Zhang,et al.  Disease-related gene module detection based on a multi-label propagation clustering algorithm , 2017, PloS one.

[35]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[36]  Boris G. Mirkin,et al.  Reinterpreting the Category Utility Function , 2001, Machine Learning.