Identification of functionally related genes using data mining and data integration: a breast cancer case study

BackgroundThe identification of the organisation and dynamics of molecular pathways is crucial for the understanding of cell function. In order to reconstruct the molecular pathways in which a gene of interest is involved in regulating a cell, it is important to identify the set of genes to which it interacts with to determine cell function. In this context, the mining and the integration of a large amount of publicly available data, regarding the transcriptome and the proteome states of a cell, are a useful resource to complement biological research.ResultsWe describe an approach for the identification of genes that interact with each other to regulate cell function. The strategy relies on the analysis of gene expression profile similarity, considering large datasets of expression data. During the similarity evaluation, the methodology determines the most significant subset of samples in which the evaluated genes are highly correlated. Hence, the strategy enables the exclusion of samples that are not relevant for each gene pair analysed. This feature is important when considering a large set of samples characterised by heterogeneous experimental conditions where different pools of biological processes can be active across the samples. The putative partners of the studied gene are then further characterised, analysing the distribution of the Gene Ontology terms and integrating the protein-protein interaction (PPI) data. The strategy was applied for the analysis of the functional relationships of a gene of known function, Pyruvate Kinase, and for the prediction of functional partners of the human transcription factor TBX3. In both cases the analysis was done on a dataset composed by breast primary tumour expression data derived from the literature. Integration and analysis of PPI data confirmed the prediction of the methodology, since the genes identified to be functionally related were associated to proteins close in the PPI network. Two genes among the predicted putative partners of TBX3 (GLI3 and GATA3) were confirmed by in vivo binding assays (crosslinking immunoprecipitation, X-ChIP) in which the putative DNA enhancer sequence sites of GATA3 and GLI3 were found to be bound by the Tbx3 protein.ConclusionThe presented strategy is demonstrated to be an effective approach to identify genes that establish functional relationships. The methodology identifies and characterises genes with a similar expression profile, through data mining and integrating data from publicly available resources, to contribute to a better understanding of gene regulation and cell function. The prediction of the TBX3 target genes GLI3 and GATA3 was experimentally confirmed.

[1]  H. Schöler,et al.  Conserved POU Binding DNA Sites in the Sox2 Upstream Enhancer Regulate Gene Expression in Embryonic and Neural Stem Cells* , 2004, Journal of Biological Chemistry.

[2]  B. Neel,et al.  Distinct populations of tumor-initiating cells derived from a tumor generated by rat mammary cancer stem cells , 2008, Proceedings of the National Academy of Sciences.

[3]  Gloria Bertoli,et al.  A rat mammary gland cancer cell with stem cell properties of self-renewal and multi-lineage differentiation , 2008, Cytotechnology.

[4]  M. Araúzo-Bravo,et al.  ReXSpecies – a tool for the analysis of the evolution of gene regulation across species , 2008, BMC Evolutionary Biology.

[5]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[6]  B. Quesnel Tumor dormancy and immunoescape   , 2008, APMIS : acta pathologica, microbiologica, et immunologica Scandinavica.

[7]  H. Anton-Culver,et al.  TBX3 is overexpressed in breast cancer and represses p14 ARF by interacting with histone deacetylases. , 2008, Cancer research.

[8]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2008 update , 2008, Nucleic Acids Res..

[9]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[10]  M. Lewis,et al.  Hedgehog Signaling in Mouse Mammary Gland Development and Neoplasia , 2004, Journal of Mammary Gland Biology and Neoplasia.

[11]  Z. Werb,et al.  GATA-3 and the regulation of the mammary luminal cell fate. , 2008, Current opinion in cell biology.

[12]  J. Gray,et al.  TBX3 and Its Isoform TBX3+2a Are Functionally Distinctive in Inhibition of Senescence and Are Overexpressed in a Subset of Breast Cancer Cell Lines , 2004, Cancer Research.

[13]  Richard P. Hill,et al.  Hypoxia and metabolism: Hypoxia, DNA repair and genetic instability , 2008, Nature Reviews Cancer.

[14]  Thomas Werner,et al.  MatInspector and beyond: promoter analysis based on transcription factor binding sites , 2005, Bioinform..

[15]  J. Blake,et al.  The Gene Ontology (GO) Project: Structured Vocabularies for Molecular Biology and Their Application to Genome and Expression Analysis , 2008, Current protocols in bioinformatics.

[16]  Alberto Riva,et al.  MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes , 2005, BMC Bioinformatics.

[17]  Gavin Sherlock,et al.  The Stanford Microarray Database: implementation of new analysis tools and open source release of software , 2002, Nucleic Acids Res..

[18]  B. Bruneau,et al.  Serum Response Factor, an Enriched Cardiac Mesoderm Obligatory Factor, Is a Downstream Gene Target for Tbx Genes* , 2005, Journal of Biological Chemistry.

[19]  M. MacDonald,et al.  TBX-3, the Gene Mutated in Ulnar-Mammary Syndrome, Is a Negative Regulator of p19 ARF and Inhibits Senescence* , 2002, The Journal of Biological Chemistry.

[20]  Ming Zhao,et al.  The hedgehog signaling molecule Gli2 induces parathyroid hormone-related peptide expression and osteolysis in metastatic human breast cancer cells. , 2006, Cancer research.

[21]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[22]  Kengo Kinoshita,et al.  COXPRESdb: a database of coexpressed gene networks in mammals , 2007, Nucleic Acids Res..

[23]  R. Dulbecco,et al.  The properties of a mammary gland cancer stem cell , 2007, Proceedings of the National Academy of Sciences.

[24]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[25]  Edgar Wingender,et al.  The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation , 2008, Briefings Bioinform..

[26]  Zena Werb,et al.  GATA-3 Maintains the Differentiation of the Luminal Cell Fate in the Mammary Gland , 2006, Cell.

[27]  J. Seidman,et al.  Mutations in human TBX3 alter limb, apocrine and genital development in ulnar-mammary syndrome , 1997, Nature Genetics.

[28]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[29]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[30]  Ru Wei,et al.  The M2 splice isoform of pyruvate kinase is important for cancer metabolism and tumour growth , 2008, Nature.

[31]  H. Lehrach,et al.  A Human Protein-Protein Interaction Network: A Resource for Annotating the Proteome , 2005, Cell.

[32]  Debashis Ghosh,et al.  Identification of GATA3 as a breast cancer prognostic marker by global gene expression meta-analysis. , 2005, Cancer research.

[33]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[34]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[35]  R. Weigel,et al.  GATA‐3 is expressed in association with estrogen receptor in breast cancer , 1999, International journal of cancer.