Identifying subset of genes that have influential impacts on cancer progression: a new approach to analyze cancer microarray data

Cancer is a complex genetic disease, resulting from defects of multiple genes. Development of microarray techniques makes it possible to survey the whole genome and detect genes that have influential impacts on the progression of cancer. Statistical analysis of cancer microarray data is challenging because of the high dimensionality and cluster nature of gene expressions. Here, clusters are composed of genes with coordinated pathological functions and/or correlated expressions. In this article, we consider cancer studies where censored survival endpoint is measured along with microarray gene expressions. We propose a hybrid clustering approach, which uses both pathological pathway information retrieved from KEGG and statistical correlations of gene expressions, to construct gene clusters. Cancer survival time is modeled as a linear function of gene expressions. We adopt the clustering threshold gradient directed regularization (CTGDR) method for simultaneous gene cluster selection, within-cluster gene selection, and predictive model building. Analysis of two lymphoma studies shows that the proposed approach – which is composed of the hybrid gene clustering, linear regression model for survival, and clustering regularized estimation with CTGDR – can effectively identify gene clusters and genes within selected clusters that have satisfactory predictive power for censored cancer survival outcomes.

[1]  Jian Huang,et al.  Clustering threshold gradient descent regularization: with applications to microarray studies , 2007, Bioinform..

[2]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[3]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[4]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[5]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[6]  I. James,et al.  Linear regression with censored data , 1979 .

[7]  Steen Knudsen Cancer Diagnostics with DNA Microarrays , 2006 .

[8]  Hongzhe Li Censored Data Regression in High-Dimension and Low-Sample Size Settings For Genomic Applications , 2006 .

[9]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[10]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Zhiliang Ying,et al.  A Large Sample Study of Rank Estimation for Censored Regression Data , 1993 .

[12]  Winfried Stute,et al.  Distributional Convergence under Random Censorship when Covariables are Present , 1996 .

[13]  Meland,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[14]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[15]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[16]  Bogdan E. Popescu,et al.  Gradient Directed Regularization , 2004 .

[17]  Winfried Stute,et al.  Consistent estimation under random censorship when covariables are present , 1993 .

[18]  L. Leoncini,et al.  Cell kinetics and cell cycle regulation in lymphomas , 2002, Journal of clinical pathology.

[19]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[20]  Z. Ying,et al.  Rank-based inference for the accelerated failure time model , 2003 .

[21]  W. Gerald,et al.  Expression Profiling of Human Tumors , 2003, Humana Press.

[22]  L. Staudt,et al.  The proliferation gene expression signature is a quantitative integrator of oncogenic events that predicts survival in mantle cell lymphoma. , 2003, Cancer cell.

[23]  Hongzhe Li,et al.  Nonparametric pathway-based regression models for analysis of genomic data. , 2007, Biostatistics.

[24]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[25]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[26]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[27]  J. Parsons,et al.  Regulated Expression of Focal Adhesion Kinase-Related Nonkinase, the Autonomously Expressed C-Terminal Domain of Focal Adhesion Kinase , 1999, Molecular and Cellular Biology.

[28]  Steen Knudsen Cancer Diagnostics with DNA Microarrays: Knudsen/Cancer Diagnostics with DNA Microarrays , 2006 .

[29]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Laura Bonetta Bioinformatics—from genes to pathways , 2004, Nature Methods.

[31]  Hou-qi Liu,et al.  Up‐regulation of ERK and p38 MAPK signaling pathways by hepatitis C virus E2 envelope protein in human T lymphoma cell line , 2006, Journal of leukocyte biology.

[32]  Cun-Hui Zhang,et al.  A group bridge approach for variable selection , 2009, Biometrika.

[33]  E. Campo,et al.  Genetic and molecular pathogenesis of mantle cell lymphoma: perspectives for new targeted therapeutics , 2007, Nature Reviews Cancer.

[34]  K. Elenitoba-Johnson,et al.  Expression of the Rho‐family GTPase gene RHOF in lymphocyte subsets and malignant lymphomas , 2005, British journal of haematology.

[35]  Lee-Jen Wei,et al.  The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. , 1992, Statistics in medicine.

[36]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[37]  E. Copelan,et al.  Purine Metabolism in Feline Lymphomas , 1990, Veterinary pathology.

[38]  Jonathan M Irish,et al.  Altered B-cell receptor signaling kinetics distinguish human follicular lymphoma B cells from tumor-infiltrating nonmalignant B cells. , 2006, Blood.