Correlating cellular features with gene expression using CCA

To understand the biology of cancer, joint analysis of multiple data modalities, including imaging and genomics, is crucial. We propose the use of canonical correlation analysis (CCA) and a sparse variant as a preliminary discovery tool for identifying connections across modalities, specifically between gene expression and features describing cell and nucleus shape, texture, and stain intensity in histopathological images. Applied to 615 breast cancer samples from The Cancer Genome Atlas, CCA revealed significant correlation of several image features with expression of PAM50 genes, known to be linked to outcome, while Sparse CCA revealed associations with enrichment of pathways implicated in cancer without leveraging prior biological understanding. These findings affirm the utility of CCA for joint phenotype-genotype analysis of cancer.