Model validation for gene selection and regulation maps

Consider the problem of investigating the structure of a set of sample points in a very high dimensional (Euclidean) space. This case is paradigmatic, for instance, in postgenomic applications. The high dimensionality and small sample size make statistical inference and optimization difficult problems, such that selecting a model or choosing a learning algorithm face the evidence that currently no consensus guidelines exist. Usually, the intervention of linear or nonlinear projection method is required to map the observations into a low-dimensional space with the most salient data features preserved. This step usually involves computing statistics from the low-dimensional projected space of features and then inferring on the highly dimensional original structures (the genes). This work deals with model validation for gene selection and regulation dynamics. The analysis is conducted through a mix of quantitative methods and qualitative aspects. A regularized inference approach is employed based on dimensionality reduction, data denoising, and feature extraction tasks. Each task requires the implementation of statistics and machine learning algorithms. We focus on the complex problem of inferring the coregulation from the coexpression gene dynamics in the presence of limited biological information and time course perturbation experiments. In particular, both separation and interference gene dynamics are considered and validated to design the most coherent underlying transcriptional regulatory map.

[1]  M. J. van der Laan,et al.  Statistical inference for simultaneous clustering of gene expression data. , 2002, Mathematical biosciences.

[2]  J. Cardoso,et al.  Blind beamforming for non-gaussian signals , 1993 .

[3]  Sanjit K. Mitra,et al.  Identifying underlying factors in breast cancer using independent component analysis , 2003, 2003 IEEE XIII Workshop on Neural Networks for Signal Processing (IEEE Cat. No.03TH8718).

[4]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[5]  Enrico Capobianco,et al.  Mining Time-dependent Gene Features , 2005, J. Bioinform. Comput. Biol..

[6]  Jean-Francois Cardoso,et al.  Source separation using higher order moments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[7]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[8]  J. Friedman Exploratory Projection Pursuit , 1987 .

[9]  M. Gerstein,et al.  Genomic analysis of regulatory network dynamics reveals large topological changes , 2004, Nature.

[10]  Denis Thieffry,et al.  RegulonDB: a database on transcriptional regulation in Escherichia coli , 1998, Nucleic Acids Res..

[11]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[13]  Richard Bonneau,et al.  The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo , 2006, Genome Biology.

[14]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[15]  Fionn Murtagh,et al.  On Ultrametricity, Data Coding, and Computation , 2004, J. Classif..

[16]  Gene H. Golub,et al.  Matrix computations , 1983 .

[17]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[18]  David J. C. MacKay,et al.  A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer , 2002, Bioinform..

[19]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  G. Crabtree,et al.  Cell signaling can direct either binary or graded transcriptional responses , 2001, The EMBO journal.

[21]  Jean-François Cardoso,et al.  Dependence, Correlation and Gaussianity in Independent Component Analysis , 2003, J. Mach. Learn. Res..

[22]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[23]  Martin Vetterli,et al.  Data Compression and Harmonic Analysis , 1998, IEEE Trans. Inf. Theory.

[24]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Pierre Comon Independent component analysis - a new concept? signal processing , 1994 .

[26]  B. Krupa,et al.  On the number of experiments required to find the causal structure of complex systems. , 2002, Journal of theoretical biology.

[27]  Stephen D. Bay,et al.  Temporal Aggregation Bias and Inference of Causal Regulatory Networks , 2004, J. Comput. Biol..

[28]  S. Batzoglou,et al.  Application of independent component analysis to microarrays , 2003, Genome Biology.

[29]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[30]  Shun-ichi Amari,et al.  Blind source separation-semiparametric statistical approach , 1997, IEEE Trans. Signal Process..

[31]  David P. Kreil,et al.  Independent component analysis of microarray data in the study of endometrial cancer , 2004, Oncogene.

[32]  Neal S. Holter,et al.  Dynamic modeling of gene expression data. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[34]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[35]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[37]  I. Jolliffe Principal Component Analysis , 2002 .

[38]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[39]  David Lindgren,et al.  Independent component analysis reveals new and biologically significant structures in micro array data , 2006, BMC Bioinformatics.

[40]  A. Danchin,et al.  Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis , 2005, European Journal of Human Genetics.

[41]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[42]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[43]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[44]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[45]  Masato Inoue,et al.  Blind Gene Classification-An Application of a Signal Separation Method , 2001 .

[46]  Wolfram Liebermeister,et al.  Linear modes of gene expression determined by independent component analysis , 2002, Bioinform..

[47]  Bruno Torrésani,et al.  Blind Source Separation and the Analysis of Microarray Data , 2004, J. Comput. Biol..

[48]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.