CrossICC: iterative consensus clustering of cross-platform gene expression data without adjusting batch effect

AVAILABILITY AND IMPLEMENTATION Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients.

[1]  Diether Lambrechts,et al.  Biology of breast cancer during pregnancy using genomic profiling. , 2014, Endocrine-related cancer.

[2]  Alan R. Dabney ClaNC: point-and-click software for classifying microarrays to nearest centroids , 2006, Bioinform..

[3]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[4]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[5]  Carme Camps,et al.  microRNA-associated progression pathways and potential therapeutic targets identified by integrated mRNA and microRNA expression profiling in breast cancer. , 2011, Cancer research.

[6]  Matthew D. Wilkerson,et al.  ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking , 2010, Bioinform..

[7]  Federico Ambrogi,et al.  Challenges in projecting clustering results across gene expression-profiling datasets. , 2007, Journal of the National Cancer Institute.

[8]  Kevin P. White,et al.  Integrative Analysis of Head and Neck Cancer Identifies Two Biologically Distinct HPV and Three Non-HPV Subtypes , 2014, Clinical Cancer Research.

[9]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Steven J. M. Jones,et al.  Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer , 2015, Cell.

[11]  Marina Vannucci,et al.  A systems biology approach reveals common metastatic pathways in osteosarcoma , 2012, BMC Systems Biology.

[12]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[13]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[14]  A. Huang,et al.  Correlation of microarray-based breast cancer molecular subtypes and clinical outcomes: implications for treatment optimization , 2011, BMC Cancer.

[15]  P. Spellman,et al.  Subtypes of Pancreatic Ductal Adenocarcinoma and Their Differing Responses to Therapy , 2011, Nature Medicine.

[16]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[17]  Lydie Lane,et al.  Down-Regulation of ECRG4, a Candidate Tumor Suppressor Gene, in Human Breast Cancer , 2011, PloS one.

[18]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[19]  Jeffrey S. Morris,et al.  The Consensus Molecular Subtypes of Colorectal Cancer , 2015, Nature Medicine.

[20]  E. Hovig,et al.  Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses , 2015, Biostatistics.

[21]  Giovanni Parmigiani,et al.  MergeMaid: R Tools for Merging and Cross-Study Validation of Gene Expression Data , 2004, Statistical applications in genetics and molecular biology.