A Topic model analysis of TCGA transcriptomic data

Topic modelling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modelling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on Breast and Lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and they are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample.

[1]  Dirk Abel,et al.  Simulation physiologischer Regelkreise mit der objektorientierten Modellbibliothek “HumanLib” , 2011, Autom..

[2]  J. Lee,et al.  Discordance of the PAM50 Intrinsic Subtypes Compared with Immunohistochemistry-Based Surrogate in Breast Cancer Patients: Potential Implication of Genomic Alterations of Discordance , 2018, Cancer research and treatment : official journal of Korean Cancer Association.

[3]  Kwok-Kin Wong,et al.  Non-small-cell lung cancers: a heterogeneous set of diseases , 2014, Nature Reviews Cancer.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Tiago P. Peixoto,et al.  The graph-tool python library , 2014 .

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Elena Papaleo,et al.  New functionalities in the TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx , 2019, PLoS Comput. Biol..

[10]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[11]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[12]  Liguo Zhang,et al.  Unifying cancer and normal RNA sequencing data from different sources , 2018, Scientific Data.

[13]  Charles M. Perou,et al.  Deconstructing the molecular portraits of breast cancer , 2010, Molecular oncology.

[14]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[15]  Elisa Ficarra,et al.  MicroRNA–mRNA interactions underlying colorectal cancer molecular subtypes , 2015, Nature Communications.

[16]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[17]  Laura Cantini,et al.  Hope4Genes: a Hopfield-like class prediction algorithm for transcriptomic data , 2019, Scientific Reports.

[18]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  D.,et al.  Regression Models and Life-Tables , 2022 .

[20]  E. Papaleo,et al.  Distinct signatures of lung cancer types: aberrant mucin O-glycosylation and compromised immune response , 2019, BMC Cancer.

[21]  Tiago P. Peixoto Nonparametric Bayesian inference of the microcanonical stochastic block model. , 2016, Physical review. E.

[22]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[23]  John W M Martens,et al.  Subtypes of breast cancer show preferential site of relapse. , 2008, Cancer research.

[24]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[25]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[26]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[27]  C. la Vecchia,et al.  Cancer mortality in Europe, 2000-2004, and an overview of trends since 1975. , 2010, Annals of oncology : official journal of the European Society for Medical Oncology.

[28]  Lodewyk F. A. Wessels,et al.  Concordance of clinical and molecular breast cancer subtyping in the context of preoperative chemotherapy response , 2009, Breast Cancer Research and Treatment.

[29]  Hanlee P. Ji,et al.  Pan-cancer analysis of the extent and consequences of intratumor heterogeneity , 2015, Nature Medicine.

[30]  Matteo Osella,et al.  Zipf and Heaps laws from dependency structures in component systems. , 2018, Physical review. E.

[31]  M. Osella,et al.  Heaps' law, statistics of shared components, and temporal patterns from a sample-space-reducing process , 2018, Physical Review E.

[32]  Michele Ceccarelli,et al.  TCGAbiolinksGUI: A graphical user interface to analyze GDC cancer molecular and clinical data , 2017, bioRxiv.

[33]  Shaowen Yao,et al.  An overview of topic modeling and its current applications in bioinformatics , 2016, SpringerPlus.

[34]  C. Fan,et al.  PAM50 assay and the three-gene model for identifying the major and clinically relevant molecular subtypes of breast cancer , 2012, Breast Cancer Research and Treatment.

[35]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[36]  Yujin Hoshida,et al.  Nearest Template Prediction: A Single-Sample-Based Flexible Class Prediction with Confidence Assessment , 2010, PloS one.

[37]  D. Haussler,et al.  Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser , 2013, Scientific Reports.

[38]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[39]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[40]  Tiago P. Peixoto Hierarchical block structures and high-resolution model selection in large networks , 2013, ArXiv.

[41]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[42]  Andrei Zinovyev,et al.  A review of computational approaches detecting microRNAs involved in cancer. , 2017, Frontiers in bioscience.

[43]  C. la Vecchia,et al.  Cancer mortality in Europe, 2005-2009, and an overview of trends since 1980. , 2013, Annals of oncology : official journal of the European Society for Medical Oncology.

[44]  Tiago P. Peixoto Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Santo Fortunato,et al.  Detection of gene communities in multi-networks reveals cancer drivers , 2015, Scientific Reports.

[46]  Michele Caselle,et al.  Statistics of shared components in complex component systems , 2017, 1707.08356.

[47]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[49]  M. Stephens,et al.  Visualizing the structure of RNA-seq expression data using grade of membership models , 2017, PLoS genetics.

[50]  P. Gonzalez-Alegre,et al.  Towards precision medicine , 2017 .

[51]  Tiago P. Peixoto,et al.  A network approach to topic models , 2017, Science Advances.

[52]  C. Furusawa,et al.  Zipf's law in gene expression. , 2002, Physical review letters.

[53]  Steven J. M. Jones,et al.  Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer , 2015, Cell.

[54]  Doug Downey,et al.  A new evaluation framework for topic modeling algorithms based on synthetic corpora , 2019, AISTATS.

[55]  Santo Fortunato,et al.  Community detection in networks: A user guide , 2016, ArXiv.

[56]  Latarsha J. Carithers,et al.  The Genotype-Tissue Expression (GTEx) Project. , 2015, Biopreservation and biobanking.

[57]  Konrad P. Körding,et al.  A high-reproducibility and high-accuracy method for automated topic classification , 2014, ArXiv.