NCIS: A NETWORK-ASSISTED CO-CLUSTERING ALGORITHM TO DISCOVER CANCER SUBTYPES BASED ON GENE EXPRESSION BY

Cancer subtype information is critically important for designing more effective treatments. In this thesis, we introduce a new co-clustering algorithm for cancer subtype identification, which combines the information of gene networks to simultaneously group samples and genes into biologically meaningful clusters. We call our method network-assisted co-clustering for the identification of cancer subtypes (NCIS). Prior to clustering, we assign weights to genes: those that play key roles in the network and/or show significant variations among samples would be prioritized. This new approach allows us to rely more on genes that are informative and representative by including the weights as an importance indicator in the clustering step. Here we introduce a new weighted co-clustering method based on semi-nonnegative matrix tri-factorization. We evaluated the effectiveness of the algorithm on large-scale Glioblastoma multiforme (GBM) and breast cancer (BRCA) datasets from TCGA and on simulated datasets. We found that our NCIS method can achieve more reliable results with respect to the clinical features compared to conventional semi-nonnegative matrix trifactorization methods and consensus clustering. We also train two classifiers for GBM and BRCA subtypes identification based on NCIS’s results. This new method will be very useful to comprehensively detect subtypes that are otherwise obscured by cancer heterogeneity, from various types of cancers based on highthroughput and high-dimensional gene expression data.

[1]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[2]  Buddhini Samarasinghe,et al.  The Hallmarks of Cancer: Fighting Back , 2013 .

[3]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[4]  Alan Wee-Chung Liew,et al.  Seed-Based Biclustering of Gene Expression Data , 2012, PloS one.

[5]  Vipin Kumar,et al.  Co-clustering phenome–genome for phenotype classification and disease gene discovery , 2012, Nucleic acids research.

[6]  Irmtraud M. Meyer,et al.  The clonal and mutational evolution spectrum of primary triple-negative breast cancers , 2012, Nature.

[7]  A. Børresen-Dale,et al.  The landscape of cancer genes and mutational processes in breast cancer , 2012, Nature.

[8]  A. Sivachenko,et al.  Sequence analysis of mutations and translocations across breast cancer subtypes , 2012, Nature.

[9]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[10]  C. Sander,et al.  Mutual exclusivity analysis identifies oncogenic network modules. , 2012, Genome research.

[11]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[12]  X. Chen,et al.  Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. , 2011, The Journal of clinical investigation.

[13]  Juan Liu,et al.  A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules , 2011, Bioinform..

[14]  Seungjin Choi,et al.  Principal network analysis: identification of subnetworks representing major dynamics using gene expression data , 2011, Bioinform..

[15]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[16]  Ronald Simon,et al.  Loss of reelin expression in breast cancer is epigenetically controlled and associated with poor prognosis. , 2010, The American journal of pathology.

[17]  Andrew Menzies,et al.  The patterns and dynamics of genomic instability in metastatic pancreatic cancer , 2010, Nature.

[18]  Tsviya Olender,et al.  GeneCards Version 3: the human gene integrator , 2010, Database J. Biol. Databases Curation.

[19]  Matthew D. Wilkerson,et al.  ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking , 2010, Bioinform..

[20]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[21]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[22]  E. Birney,et al.  A small cell lung cancer genome reports complex tobacco exposure signatures , 2009, Nature.

[23]  Tom Royce,et al.  A comprehensive catalogue of somatic mutations from a human cancer genome , 2010, Nature.

[24]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[25]  Olga G. Troyanskaya,et al.  Detailing regulatory networks through large scale data integration , 2009, Bioinform..

[26]  Rong Jin,et al.  Reconstruct modular phenotype-specific gene networks by knowledge-driven matrix factorization , 2009, Bioinform..

[27]  Ying Xu,et al.  QUBIC: a qualitative biclustering algorithm for analyses of gene expression data , 2009, Nucleic acids research.

[28]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[29]  Yuri Kotliarov,et al.  Unsupervised analysis of transcriptomic profiles reveals six glioma subtypes. , 2009, Cancer research.

[30]  Yingdong Zhao,et al.  Non-negative matrix factorization of gene expression profiles: a plug-in for BRB-ArrayTools , 2009, Bioinform..

[31]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[32]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[33]  Gordon B. Mills,et al.  Derailed endocytosis: an emerging feature of cancer , 2008, Nature Reviews Cancer.

[34]  A. Nobel,et al.  Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[35]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[36]  Jun S Liu,et al.  Bayesian biclustering of gene expression data , 2008, BMC Genomics.

[37]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[38]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[39]  Hyunsoo Kim,et al.  Sparse Non-negative Matrix Factorizations via Alternating Non-negativity-constrained Least Squares , 2006 .

[40]  Roded Sharan,et al.  Biclustering Algorithms: A Survey , 2007 .

[41]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[42]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[43]  Thomas D. Wu,et al.  Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. , 2006, Cancer cell.

[44]  S. Fesik Promoting apoptosis as a strategy for cancer drug discovery , 2005, Nature Reviews Cancer.

[45]  Yuan Gao,et al.  Improving molecular cancer class discovery through sparse non-negative matrix factorization , 2005 .

[46]  Desmond J. Higham,et al.  GeneRank: Using search engine technology for the analysis of microarray experiments , 2005, BMC Bioinformatics.

[47]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[48]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[49]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[51]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[52]  Desmond J. Higham,et al.  The Sleekest Link Algorithm , 2003 .

[53]  Paul S Mischel,et al.  Gene expression profiling identifies molecular subtypes of gliomas , 2003, Oncogene.

[54]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[55]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[56]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[57]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[58]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[59]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[60]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[61]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[62]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Vichi Maurizio Double k-means Clustering for Simultaneous Classification of Objects and Variables , 2001 .

[64]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[65]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[66]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[67]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[68]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[69]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[70]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[71]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.