Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach

ABSTRACT Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computational identification of TFs is emerging as a useful approach to bridge functional genomics with disease risk loci. In this article, we use large-scale gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theoretical analysis, which are also applicable to a variety of large-scale data analyses. (ii) From an experimental perspective, our method generates an informative list of tumor-related TFs and their possible effected tumor types. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, many of which have not been reported before. In summary, our work established a robust method to identify the association between TFs and biological contexts. Given the limited amount of genome-wide binding profiles of TFs and the massive number of expression profiles, our work provides a useful tool to deconvolute the gene regulatory network for tumors and other biological contexts. Supplementary materials for this article are available online.

[1]  H. Grosse [Diabetes and cancer]. , 1956, Deutsche Zeitschrift fur Verdauungs- und Stoffwechselkrankheiten.

[2]  E. J. Gregory,et al.  Estrogen receptor as an independent prognostic factor for early recurrence in breast cancer. , 1977, Cancer research.

[3]  R. Evans,et al.  The steroid and thyroid hormone receptor superfamily. , 1988, Science.

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  T. W. Anderson,et al.  Statistical Inference in Elliptically Contoured and Related Distributions , 1990 .

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  E. Wagner,et al.  Complete block of early B cell differentiation and altered patterning of the posterior midbrain in mice lacking Pax5 BSAP , 1994, Cell.

[8]  K. Umesono,et al.  The nuclear receptor superfamily: The second decade , 1995, Cell.

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  A. Manning Transcription factors: a new frontier for drug discovery , 1996 .

[11]  R. Dalla‐Favera,et al.  The t(9;14)(p13;q32) chromosomal translocation associated with lymphoplasmacytoid lymphoma involves the PAX-5 gene. , 1996, Blood.

[12]  M. Busslinger,et al.  Deregulation of PAX-5 by translocation of the Emu enhancer of the IgH locus adjacent to two alternative PAX-5 promoters in a diffuse large-cell lymphoma. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[13]  C. Leonetti,et al.  Antitumor effect of c-myc antisense phosphorothioate oligodeoxynucleotides on human melanoma cells in vitro and and in mice. , 1996, Journal of the National Cancer Institute.

[14]  J. Piette,et al.  Multiple redox regulation in NF-kappaB transcription factor activation. , 1997, Biological chemistry.

[15]  M. Busslinger,et al.  Cooperation of Pax2 and Pax5 in midbrain and cerebellum development. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[16]  O. Haas,et al.  Deregulated PAX-5 transcription from a translocated IgH promoter in marginal zone lymphoma. , 1998, Blood.

[17]  D. Pfaff,et al.  Roles of estrogen receptor-alpha gene expression in reproduction-related behaviors in female mice. , 1998, Endocrinology.

[18]  B. Rayet,et al.  Aberrant rel/nfkb genes and activity in human cancer , 1999, Oncogene.

[19]  M. Taniwaki,et al.  Detection of MUM1/IRF4-IgH fusion in multiple myeloma , 1999, Leukemia.

[20]  L. Gullo,et al.  Diabetes and the risk of pancreatic cancer. , 1994, Annals of oncology : official journal of the European Society for Medical Oncology.

[21]  P. Chambon,et al.  Effect of single and compound knockouts of estrogen receptors alpha (ERalpha) and beta (ERbeta) on mouse reproductive phenotypes. , 2000, Development.

[22]  G. Iwamoto,et al.  Increased adipose tissue in male and female estrogen receptor-α knockout mice , 2000 .

[23]  M. Busslinger,et al.  Functional equivalence of the transcription factors Pax2 and Pax5 in mouse development. , 2000, Development.

[24]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[25]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[26]  M. Karin,et al.  AP-1 as a regulator of cell life and death , 2002, Nature Cell Biology.

[27]  J. Darnell Transcription factors as targets for cancer therapy , 2002, Nature Reviews Cancer.

[28]  C. Grimaldi,et al.  Estrogen alters thresholds for B cell apoptosis and activation. , 2002, The Journal of clinical investigation.

[29]  E. Wagner,et al.  AP-1: a double-edged sword in tumorigenesis , 2003, Nature Reviews Cancer.

[30]  V. Laudet,et al.  The nuclear receptor superfamily , 2003, Journal of Cell Science.

[31]  Michael Q. Zhang,et al.  A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  Cyrus Martin,et al.  The diverse functions of histone lysine methylation , 2005, Nature Reviews Molecular Cell Biology.

[34]  P. Laird Cancer epigenetics. , 2005, Human molecular genetics.

[35]  A. Boulesteix,et al.  Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach , 2005, Theoretical Biology and Medical Modelling.

[36]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[37]  Anke Sparmann,et al.  Polycomb silencers control cell fate, development and cancer , 2006, Nature Reviews Cancer.

[38]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[39]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[40]  D. Ghosh,et al.  A polycomb repression signature in metastatic prostate cancer predicts cancer outcome. , 2007, Cancer research.

[41]  L. Altucci,et al.  RAR and RXR modulation in cancer and metabolic disease , 2007, Nature Reviews Drug Discovery.

[42]  K. Harvey,et al.  The Salvador–Warts–Hippo pathway — an emerging tumour-suppressor network , 2007, Nature Reviews Cancer.

[43]  J. Sklar,et al.  Effects of rearrangement and allelic exclusion of JJAZ1/SUZ12 on cell proliferation and survival , 2007, Proceedings of the National Academy of Sciences.

[44]  P. Chambon,et al.  Estrogen Prevents Bone Loss via Estrogen Receptor α and Induction of Fas Ligand in Osteoclasts , 2007, Cell.

[45]  Rachel B. Brem,et al.  Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks , 2008, Nature Genetics.

[46]  T. Giordano,et al.  C-MYC overexpression is required for continuous suppression of oncogene-induced senescence in melanoma cells , 2008, Oncogene.

[47]  J. Sklar,et al.  A Neoplastic Gene Fusion Mimics Trans-Splicing of RNAs in Normal Human Cells , 2008, Science.

[48]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[49]  R. Young,et al.  SetDB1 contributes to repression of genes encoding developmental regulators and maintenance of ES cell state. , 2009, Genes & development.

[50]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[51]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Chong Wang,et al.  Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process , 2009, NIPS.

[53]  M. Piris,et al.  Deregulated expression of the polycomb-group protein SUZ12 target genes characterizes mantle cell lymphoma. , 2010, The American journal of pathology.

[54]  D. Harlan,et al.  Diabetes and Cancer , 2010, Diabetes Care.

[55]  E. Hurt,et al.  Clinical significance of Polycomb gene expression in brain tumors , 2010, Molecular Cancer.

[56]  C. Allis,et al.  Covalent histone modifications — miswritten, misinterpreted and mis-erased in human cancers , 2010, Nature Reviews Cancer.

[57]  Rafael A Irizarry,et al.  Frozen robust multiarray analysis (fRMA). , 2010, Biostatistics.

[58]  O. Catoni Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.

[59]  David A. Orlando,et al.  The histone methyltransferase SETDB1 is recurrently amplified in melanoma and accelerates its onset , 2011, Nature.

[60]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[61]  Matthew N. McCall,et al.  The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes , 2010, Nucleic Acids Res..

[62]  G. Halder,et al.  Hippo signaling: growth control and beyond , 2011, Development.

[63]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[64]  Min-Dian Li,et al.  A Retrospective on Nuclear Receptor Regulation of Inflammation: Lessons from GR and PPARs , 2011, PPAR research.

[65]  R. Gascoyne,et al.  MYC and Aggressive B-cell Lymphomas , 2011, Advances in anatomic pathology.

[66]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[67]  Andrew McCallum,et al.  Topic models for taxonomies , 2012, JCDL '12.

[68]  M. Dawson,et al.  Cancer Epigenetics: From Mechanism to Therapy , 2012, Cell.

[69]  Hongkai Ji,et al.  ChIPXpress: using publicly available gene expression data to improve ChIP-seq and ChIP-chip target gene ranking , 2013, BMC Bioinformatics.

[70]  David M. Mimno,et al.  Computational historiography: Data mining in a century of classics journals , 2012, JOCCH.

[71]  E. Greer,et al.  Histone methylation: a dynamic mark in health, disease and inheritance , 2012, Nature Reviews Genetics.

[72]  J. Linton,et al.  NaV1.1 channels are critical for intercellular communication in the suprachiasmatic nucleus and for normal circadian rhythms , 2012, Proceedings of the National Academy of Sciences.

[73]  Q. Cai,et al.  SUZ12 Promotes Human Epithelial Ovarian Cancer by Suppressing Apoptosis via Silencing HRK , 2012, Molecular Cancer Research.

[74]  Yan Liu,et al.  Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems , 2012, ICML.

[75]  Fang Han,et al.  Transelliptical Component Analysis , 2012, NIPS.

[76]  Fang Han,et al.  Transelliptical Graphical Models , 2012, NIPS.

[77]  C. Bountra,et al.  Epigenetic protein families: a new frontier for drug discovery , 2012, Nature Reviews Drug Discovery.

[78]  T. Cai,et al.  Sparse PCA: Optimal rates and adaptive estimation , 2012, 1211.1309.

[79]  Xiao-Tong Yuan,et al.  Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[80]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[81]  Daphna Weinshall,et al.  Modeling Musical Influence with Topic Models , 2013, ICML.

[82]  Han Liu,et al.  Optimal Rates of Convergence of Transelliptical Component Analysis , 2013 .

[83]  Jing Lei,et al.  Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA , 2013, NIPS.

[84]  B. Bernstein,et al.  Epigenetic Reprogramming in Cancer , 2013, Science.

[85]  David M. Thomas,et al.  The Hippo pathway and human cancer , 2013, Nature Reviews Cancer.

[86]  F. Camargo,et al.  The Ets transcription factor GABP is a component of the hippo pathway essential for growth and antioxidant defense. , 2013, Cell reports.

[87]  Matthew N. McCall,et al.  ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data , 2013, Bioinform..