Self-supervised learning for characterising histomorphological diversity and spatial RNA expression prediction across 23 human tissue types

As vast histological archives are digitised, there is a pressing need to be able to associate specific tissue substructures and incident pathology to disease outcomes without arduous annotation. Such automation provides an opportunity to learn fundamental biology about how tissue structure and function varies in a population. Recently, self-supervised learning has proven competitive to supervised machine learning approaches in classification, segmentation and representation learning. Here, we leverage self-supervised learning to generate histology feature representations using 1.7M images across 23 healthy tissues in 838 donors from GTEx. Using these representations, we demonstrate we can automatically segment tissues into their constituent tissue substructures and pathology proportions, and surpass the performance of conventionally used pre-trained models. We observe striking population variability in canonical tissue substructures, highlight examples of missing pathological diagnoses, incorrect assignment of target tissue and cross-tissue contamination. We demonstrate that this variability in tissue composition leads to a likely overestimation of eQTL tissue sharing and drives dramatic differential gene expression changes. We use derived tissue substructures to detect 284 tissue substructures and pathology specific eQTLs. As our derived histology representations are rich morphological descriptors of the underlying tissue, we introduce a multiple instance learning model that can predict and spatially localise individual RNA expression levels directly from histology to specific substructures and pathological features. We validate our RNA spatial predictions with matched ground truth immunohistochemistry (IHC) for several well characterised marker genes, recapitulating their known spatial specificity. Finally, we derive a gene expression spatial enrichment metric, allowing us to detect genes specifically expressed within sites of pathology (e.g. arterial calcification). Together, these results demonstrate the power of self-supervised machine learning when applied to vast histological datasets to allow researchers to pose and answer questions about tissue pathology, its spatial organisation and the interplay between morphological tissue variability and gene expression.

[1]  Garry P. Nolan,et al.  Organization of the human intestine at single-cell resolution , 2023, Nature.

[2]  S. Ishikawa,et al.  Restaining-based annotation for cancer histology segmentation to overcome annotation-related limitations among pathologists , 2023, Patterns.

[3]  Yongjun Jiang,et al.  Detection of differentially expressed genes in spatial transcriptomics data by spatial analysis of spatial transcriptomics: A novel method based on spatial statistics , 2022, Frontiers in Neuroscience.

[4]  Gregory W. Gundersen,et al.  Linking histology and molecular state across human tissues , 2022, bioRxiv.

[5]  Adalberto Claudio Quiros,et al.  Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unlabeled, unannotated pathology slides , 2022, 2205.01931.

[6]  K. Sirinukunwattana,et al.  Automated quality assessment of large digitised histology cohorts by artificial intelligence , 2022, Scientific Reports.

[7]  R. G. Krishnan,et al.  Self-Supervised Vision Transformers Learn Visual Concepts in Histopathology , 2022, ArXiv.

[8]  Meijuan Cheng,et al.  The intermediate‐conductance calcium‐activated potassium channel KCa3.1 contributes to alkalinization‐induced vascular calcification in vitro , 2021, Journal of clinical laboratory analysis.

[9]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  C. Lindskog,et al.  The Human Protein Atlas—Spatial localization of the human proteome in health and disease , 2020, Protein science : a publication of the Protein Society.

[12]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[13]  Pierre Courtiol,et al.  A deep learning model to predict RNA-Seq expression of tumours from whole slide images , 2020, Nature Communications.

[14]  C. Diorio,et al.  The Importance of Breast Adipose Tissue in Breast Cancer , 2020, International journal of molecular sciences.

[15]  Gonçalo Abecasis,et al.  Computationally efficient whole-genome regression for quantitative and binary traits , 2020, Nature Genetics.

[16]  Ming Y. Lu,et al.  Data-efficient and weakly supervised computational pathology on whole-slide images , 2020, Nature Biomedical Engineering.

[17]  Kelly A. Frazer,et al.  Cellular deconvolution of GTEx tissues powers discovery of disease and cell-type associated regulatory variants , 2020, Nature Communications.

[18]  Brent S. Pedersen,et al.  Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches , 2019, bioRxiv.

[19]  Alexander W. Jung,et al.  Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis , 2019, Nature Cancer.

[20]  Stephane E. Castel,et al.  Cell type–specific genetic regulation of gene expression across human tissues , 2019, Science.

[21]  Christopher D. Brown,et al.  The GTEx Consortium atlas of genetic regulatory effects across human tissues , 2019, Science.

[22]  C. Lindgren,et al.  Machine Learning based histology phenotyping to investigate epidemiologic and genetic basis of adipocyte morphology and cardiometabolic traits , 2019, bioRxiv.

[23]  G. Radicioni,et al.  Localization of Secretory Mucins MUC5AC and MUC5B in Normal/Healthy Human Airways , 2019, American journal of respiratory and critical care medicine.

[24]  Eliezer M. Van Allen,et al.  Scaling computational genomics to millions of individuals with GPUs , 2018, Genome Biology.

[25]  B. Engelhardt,et al.  Joint analysis of expression levels and histological images identifies genes associated with tissue morphology , 2018, Nature Communications.

[26]  R. Rey,et al.  Clinical and Etiological Aspects of Gynecomastia in Adult Males: A Multicenter Study , 2018, BioMed research international.

[27]  C. A. Glastonbury,et al.  Cell-Type Heterogeneity in Adipose Tissue Is Associated with Complex Traits and Reveals Disease-Relevant Cell-Specific eQTLs , 2018, bioRxiv.

[28]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[29]  Peter Filzmoser,et al.  Weighted Pivot Coordinates for Compositional Data and Their Application to Geochemical Mapping , 2017, Mathematical Geosciences.

[30]  Peter Bankhead,et al.  QuPath: Open source software for digital pathology image analysis , 2017, Scientific Reports.

[31]  T. Spector,et al.  Adiposity-Dependent Regulatory Effects on Multi-tissue Transcriptomes , 2016, American journal of human genetics.

[32]  P. Soler-Palacín,et al.  Off-label use of rilpivirine in combination with emtricitabine and tenofovir in HIV-1-infected pediatric patients , 2016, Medicine.

[33]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[34]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[35]  Simon C. Potter,et al.  Mapping cis- and trans-regulatory effects across multiple tissues in twins , 2012, Nature Genetics.

[36]  V. Pawlowsky-Glahn,et al.  Compositional data analysis : theory and applications , 2011 .

[37]  Susan E Wert,et al.  SPDEF is required for mouse pulmonary goblet cell differentiation and regulates a network of genes associated with mucus production. , 2009, The Journal of clinical investigation.

[38]  J. S. Marron,et al.  A method for normalizing histology slides for quantitative analysis , 2009, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[39]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  P. Moran Notes on continuous stochastic phenomena. , 1950, Biometrika.

[41]  Barbara E. Engelhardt,et al.  End-to-end Training of Deep Probabilistic CCA on Paired Biomedical Observations , 2019, UAI.

[42]  Wuchun Cao,et al.  Novel Susceptibility Loci for Moyamoya Disease Revealed by a Genome-Wide Association Study , 2018, Stroke.

[43]  Jia Deng,et al.  A large-scale hierarchical image database , 2009, CVPR 2009.