Biomarker Identification by Knowledge-Driven Multi-Level ICA and Motif Analysis

Many statistical methods often fail to identify biologically meaningful biomarkers related to a specific disease under study from expression data alone. In this paper, we develop a novel strategy, namely knowledge-driven multi-level independent component analysis (ICA), to infer regulatory signals and identify biologically relevant biomarkers from microarray data. Specifically, based on multi-level clustering results and partial prior knowledge, we apply ICA to find stable disease specific linear regulatory modes and then extract associated biomarker genes. A statistical test is designed to evaluate the significance of transcription factor enrichment for extracted gene set based on motif information. The experimental results on an Rsf-1 induced microarray data set show that our knowledge-driven method can extract more biologically meaningful biomarkers with significant enrichment of transcription factors related to ovarian cancer compared to other gene selection methods with/without prior knowledge.

[1]  Ryszard S. Michalski,et al.  A theory and methodology of inductive learning , 1993 .

[2]  Sheng‐Chung Lee,et al.  Functional interaction between nuclear matrix-associated HBXAP and NF-κB , 2004 .

[3]  Liu Zhuang Cognition of Actions , 2005 .

[4]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[5]  Patrick Henry Winston,et al.  Learning structural descriptions from examples , 1970 .

[6]  Tom Michael Mitchell Version spaces: an approach to concept learning. , 1979 .

[7]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Philippe Mulhem,et al.  Fuzzy Conceptual Graphs for Matching Images of Natural Scenes , 2001, IJCAI.

[9]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  S. Batzoglou,et al.  Application of independent component analysis to microarrays , 2003, Genome Biology.

[11]  Karin Milde-Langosch,et al.  The Fos family of transcription factors and their role in tumourigenesis. , 2005, European journal of cancer.

[12]  Jian-Yun Nie,et al.  Automatic construction of parallel English-Chinese corpus for cross-language information retrieval , 2000, ANLP.

[13]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[14]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[15]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[16]  Aapo Hyvärinen,et al.  A Fast Fixed-Point Algorithm for Independent Component Analysis , 1997, Neural Computation.

[17]  Lei Xu Ovarian cancer angiogenesis, biology and therapy , 2000 .

[18]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ana Conesa,et al.  maSigPro: a Method to Identify Significantly Differential Expression Profiles in Time-Course Microarray Experiments , 2006, Spanish Bioinformatics Conference.

[20]  Masato Inoue,et al.  BLIND GENE CLASSIFICATION BASED ON ICA OF MICROARRAY DATA , 2001 .

[21]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[22]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[23]  Leonard Pitt,et al.  Version Spaces without Boundary Sets , 1997, AAAI/IAAI.

[24]  Michèle Sebag Using Constraints to Building Version Spaces , 1994, ECML.

[25]  Giovanni Parmigiani,et al.  Amplification of a chromatin remodeling gene, Rsf-1/HBXAP, in ovarian carcinoma. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Byoung-Tak Zhang,et al.  Identification of regulatory modules by co-clustering latent variable models: stem cell differentiation , 2006, Bioinform..

[27]  Jiri Matas,et al.  WaldBoost - learning for time constrained sequential detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[29]  Adam A. Margolin,et al.  Reverse engineering of regulatory networks in human B cells , 2005, Nature Genetics.

[30]  J. Richards,et al.  Regulation of AP1 (Jun/Fos) Factor Expression and Activation in Ovarian Granulosa Cells , 2000, The Journal of Biological Chemistry.

[31]  Bernd Neumann,et al.  Feedback between Low-level and High-level Image Processing , 2007 .

[32]  Ying Wang,et al.  IL-8 Reduced Tumorigenicity of Human Ovarian Cancer In Vivo Due to Neutrophil Infiltration1 , 2000, The Journal of Immunology.

[33]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[34]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[35]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[36]  Robert L. Mercer,et al.  A Statistical Approach to Sense Disambiguation in Machine Translation , 1991, HLT.

[37]  E. Oja,et al.  Independent Component Analysis , 2013 .

[38]  Chen Wang,et al.  Stability-Based Dimension Estimation of ICA with Application to Microarray Data Analysis , 2007, BIOCOMP.

[39]  Robert Clarke,et al.  Motif-directed network component analysis for regulatory network inference , 2008, BMC Bioinformatics.

[40]  Bernd Neumann,et al.  A Conceptual Framework for High-Level Vision , 2003 .

[41]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[42]  Bernd Neumann,et al.  Scene Interpretation as a Configuration Task , 2005, Künstliche Intell..

[43]  S. Schneider-Maunoury,et al.  Multiple pituitary and ovarian defects in Krox-24 (NGFI-A, Egr-1)-targeted mice. , 1998, Molecular endocrinology.

[44]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[45]  Joo-Hwee Lim,et al.  Home Photo Content Modeling for Personalized Event-Based Retrieval , 2003, IEEE Multim..

[46]  J. Devore,et al.  Statistics: The Exploration and Analysis of Data , 1986 .

[47]  Haym Hirsh,et al.  Learning DNF Via Probabilistic Evidence Combination , 1993, ICML.

[48]  Ryung S. Kim,et al.  An improved distance measure between the expression profiles linking co-expression and co-regulation in mouse , 2006, BMC Bioinformatics.

[49]  Wolfram Liebermeister,et al.  Linear modes of gene expression determined by independent component analysis , 2002, Bioinform..

[50]  Zhiping Weng,et al.  PromoSer: a large-scale mammalian promoter and transcription start site identification service , 2003, Nucleic Acids Res..

[51]  David P. Kreil,et al.  Independent component analysis of microarray data in the study of endometrial cancer , 2004, Oncogene.

[52]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[53]  Ralph S Freedman,et al.  Ovarian cancer, the coagulation pathway, and inflammation , 2005, Journal of Translational Medicine.

[54]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[55]  Luc De Raedt,et al.  The Levelwise Version Space Algorithm and its Application to Molecular Fragment Finding , 2001, IJCAI.

[56]  Jiebo Luo,et al.  Scene Parsing Using Region-Based Generative Models , 2007, IEEE Transactions on Multimedia.

[57]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[58]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.