Dual Hypergraph Regularized PCA for Biclustering of Tumor Gene Expression Data

Clustering is a powerful approach to analyze gene expression data which is crucial to the investigation of effective treatment of cancer. Many graph regularize-based clustering methods have been proposed and shown to be superior to the traditional clustering methods. However, they only focus on the inner structure in samples and fail to take the feature manifold into account. In gene expression data, it's practical to hypothesize that both the samples and the genes lie on nonlinear low dimensional manifolds, namely sample manifold and gene manifold, respectively. Therefore in this paper, incorporating the geometric structures in both samples and features, we propose a Dual Hypergraph Regularized PCA (DHPCA) method for biclustering of tumor data. First, for gene expression data, we construct two hypergraphs, i.e., sample hypergraph and gene hypergraph, to estimate the intrinsic geometric structures of samples and genes. Then, we introduce the hypergraph regularization on both gene side and sample side. Finally, our biclustering method is formulated as two hypergraph regularized PCA with closed-form solution. We experimentally validate our proposed DHPCA algorithm on real applications and the promising results indicate its potential in high dimension data analysis.

[1]  Jiawei Han,et al.  Non-negative Matrix Factorization on Manifold , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Weiwei Liu,et al.  On the Optimality of Classifier Chain for Multi-label Classification , 2015, NIPS.

[3]  Jim Jing-Yan Wang,et al.  Multiple graph regularized nonnegative matrix factorization , 2013, Pattern Recognit..

[4]  Yuxiao Hu,et al.  Face recognition using Laplacianfaces , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[6]  Peng Yang,et al.  Robust Online Multi-Task Learning with Correlative and Personalized Structures , 2017, IEEE Transactions on Knowledge and Data Engineering.

[7]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[8]  Weiwei Liu,et al.  Sparse Embedded k-Means Clustering , 2017, NIPS.

[9]  Jukka Corander,et al.  Kpax3: Bayesian bi-clustering of large sequence datasets , 2018, Bioinform..

[10]  Xing-Ming Zhao,et al.  A Novel Clustering Analysis Based on PCA and SOMs for Gene Expression Patterns , 2004, ISNN.

[11]  Yuan Gao,et al.  Improving molecular cancer class discovery through sparse non-negative matrix factorization , 2005 .

[12]  Qingshan Liu,et al.  Image retrieval via probabilistic hypergraph ranking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Feiping Nie,et al.  PurTreeClust: A Clustering Algorithm for Customer Segmentation from Massive Customer Transaction Data , 2018, IEEE Transactions on Knowledge and Data Engineering.

[14]  Jieping Ye,et al.  Hypergraph spectral learning for multi-label classification , 2008, KDD.

[15]  Lei Zhang,et al.  Tumor Clustering Using Nonnegative Matrix Factorization With Gene Selection , 2009, IEEE Transactions on Information Technology in Biomedicine.

[16]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  E. Bornberg-Bauer,et al.  The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. , 2007, The Plant journal : for cell and molecular biology.

[18]  Simon C. K. Shiu,et al.  Molecular Pattern Discovery Based on Penalized Matrix Decomposition , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[20]  Eytan Domany,et al.  Coupled Two-way Clustering Analysis of Breast Cancer and Colon Cancer Gene Expression Data , 2002, Bioinform..

[21]  Yong Xu,et al.  Characteristic Gene Selection Based on Robust Graph Regularized Non-Negative Matrix Factorization , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Xiaojun Wu,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[24]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[25]  M. Heller DNA microarray technology: devices, systems, and applications. , 2002, Annual review of biomedical engineering.

[26]  Domenico Saccà,et al.  Malevolent Activity Detection with Hypergraph-Based Models , 2017, IEEE Transactions on Knowledge and Data Engineering.

[27]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[28]  Jiguo Yu,et al.  Robust Nonnegative Matrix Factorization via Joint Graph Laplacian and Discriminative Information for Identifying Differentially Expressed Genes , 2017, Complex..

[29]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[30]  Hujun Bao,et al.  Laplacian Regularized Gaussian Mixture Model for Data Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[31]  Ricardo J. G. B. Campello,et al.  A systematic comparative evaluation of biclustering techniques , 2017, BMC Bioinformatics.

[32]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[34]  Fei Wang,et al.  Graph dual regularization non-negative matrix factorization for co-clustering , 2012, Pattern Recognit..

[35]  Weiwei Liu,et al.  Metric Learning for Multi-Output Tasks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Yevgeniy Vorobeychik,et al.  Scalable Iterative Classification for Sanitizing Large-Scale Datasets , 2017, IEEE Transactions on Knowledge and Data Engineering.

[37]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[38]  Huanhuan Chen,et al.  Scalable Graph-Based Semi-Supervised Learning through Sparse Bayesian Model , 2017, IEEE Transactions on Knowledge and Data Engineering.

[39]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[40]  Jingrui He,et al.  Feature co-shrinking for co-clustering , 2018, Pattern Recognit..

[41]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[43]  Haesun Park,et al.  MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[44]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[45]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[46]  Amnon Shashua,et al.  Probabilistic graph and hypergraph matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Quanquan Gu,et al.  Co-clustering on manifolds , 2009, KDD.

[48]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[49]  Weiwei Liu,et al.  An Easy-to-hard Learning Paradigm for Multiple Classes and Multiple Labels , 2017, J. Mach. Learn. Res..

[50]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[51]  Jin Tang,et al.  Graph-Laplacian PCA: Closed-Form Solution and Robustness , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.