Knowledge-guided multi-scale independent component analysis for biomarker identification

BackgroundMany statistical methods have been proposed to identify disease biomarkers from gene expression profiles. However, from gene expression profile data alone, statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study. In this paper, we develop a novel strategy, namely knowledge-guided multi-scale independent component analysis (ICA), to first infer regulatory signals and then identify biologically relevant biomarkers from microarray data.ResultsSince gene expression levels reflect the joint effect of several underlying biological functions, disease-specific biomarkers may be involved in several distinct biological functions. To identify disease-specific biomarkers that provide unique mechanistic insights, a meta-data "knowledge gene pool" (KGP) is first constructed from multiple data sources to provide important information on the likely functions (such as gene ontology information) and regulatory events (such as promoter responsive elements) associated with potential genes of interest. The gene expression and biological meta data associated with the members of the KGP can then be used to guide subsequent analysis. ICA is then applied to multi-scale gene clusters to reveal regulatory modes reflecting the underlying biological mechanisms. Finally disease-specific biomarkers are extracted by their weighted connectivity scores associated with the extracted regulatory modes. A statistical significance test is used to evaluate the significance of transcription factor enrichment for the extracted gene set based on motif information. We applied the proposed method to yeast cell cycle microarray data and Rsf-1-induced ovarian cancer microarray data. The results show that our knowledge-guided ICA approach can extract biologically meaningful regulatory modes and outperform several baseline methods for biomarker identification.ConclusionWe have proposed a novel method, namely knowledge-guided multi-scale ICA, to identify disease-specific biomarkers. The goal is to infer knowledge-relevant regulatory signals and then identify corresponding biomarkers through a multi-scale strategy. The approach has been successfully applied to two expression profiling experiments to demonstrate its improved performance in extracting biologically meaningful and disease-related biomarkers. More importantly, the proposed approach shows promising results to infer novel biomarkers for ovarian cancer and extend current knowledge.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[3]  Byoung-Tak Zhang,et al.  Identification of regulatory modules by co-clustering latent variable models: stem cell differentiation , 2006, Bioinform..

[4]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[5]  J. Richards,et al.  Regulation of AP1 (Jun/Fos) Factor Expression and Activation in Ovarian Granulosa Cells , 2000, The Journal of Biological Chemistry.

[6]  Sheng‐Chung Lee,et al.  Functional interaction between nuclear matrix-associated HBXAP and NF-kappaB. , 2004, Experimental cell research.

[7]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[8]  Lei Xu Ovarian cancer angiogenesis, biology and therapy , 2000 .

[9]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Adam A. Margolin,et al.  Reverse engineering of regulatory networks in human B cells , 2005, Nature Genetics.

[11]  Mamoru Fukuda,et al.  Down-regulation of BRCA1-BARD1 ubiquitin ligase by CDK2. , 2005, Cancer research.

[12]  Chen Wang,et al.  Stability-Based Dimension Estimation of ICA with Application to Microarray Data Analysis , 2007, BIOCOMP.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[14]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  S. Schneider-Maunoury,et al.  Multiple pituitary and ovarian defects in Krox-24 (NGFI-A, Egr-1)-targeted mice. , 1998, Molecular endocrinology.

[16]  Guide to Probe Logarithmic Intensity Error ( PLIER ) Estimation , 2005 .

[17]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[19]  Karin Milde-Langosch,et al.  The Fos family of transcription factors and their role in tumourigenesis. , 2005, European journal of cancer.

[20]  David P. Kreil,et al.  Independent component analysis of microarray data in the study of endometrial cancer , 2004, Oncogene.

[21]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[22]  Ana Conesa,et al.  maSigPro: a Method to Identify Significantly Differential Expression Profiles in Time-Course Microarray Experiments , 2006, Spanish Bioinformatics Conference.

[23]  Masato Inoue,et al.  BLIND GENE CLASSIFICATION BASED ON ICA OF MICROARRAY DATA , 2001 .

[24]  Chiara Sabatti,et al.  Network component analysis: Reconstruction of regulatory signals in biological systems , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Robert Clarke,et al.  Gene Module Identification from Microarray Data Using Nonnegative Independent Component Analysis , 2008, Gene regulation and systems biology.

[26]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[27]  Paul D. Minton,et al.  Statistics: The Exploration and Analysis of Data , 2002, Technometrics.

[28]  Wolfram Liebermeister,et al.  Linear modes of gene expression determined by independent component analysis , 2002, Bioinform..

[29]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[30]  S. Batzoglou,et al.  Application of independent component analysis to microarrays , 2003, Genome Biology.

[31]  Robert Clarke,et al.  Motif-directed network component analysis for regulatory network inference , 2008, BMC Bioinformatics.

[32]  David Lindgren,et al.  Independent component analysis reveals new and biologically significant structures in micro array data , 2006, BMC Bioinformatics.

[33]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[34]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[35]  Ying Wang,et al.  IL-8 Reduced Tumorigenicity of Human Ovarian Cancer In Vivo Due to Neutrophil Infiltration1 , 2000, The Journal of Immunology.

[36]  J. Devore,et al.  Statistics: The Exploration and Analysis of Data , 1986 .

[37]  Giovanni Parmigiani,et al.  Amplification of a chromatin remodeling gene, Rsf-1/HBXAP, in ovarian carcinoma. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[39]  Sheng‐Chung Lee,et al.  Functional interaction between nuclear matrix-associated HBXAP and NF-κB , 2004 .