A new symbolic representation for the identification of informative genes in replicated microarray experiments.

Microarray experiments generate massive amounts of data, necessitating innovative algorithms to distinguish biologically relevant information from noise. Because the variability of gene expression data is an important factor in determining which genes are differentially expressed, analysis techniques that take into account repeated measurements are critically important. Additionally, the selection of informative genes is typically done by searching for the individual genes that vary the most across conditions. Yet because genes tend to act in groups rather than individually, it may be possible to glean more information from the data by searching specifically for concerted behavior in a set of genes. Applying a symbolic transformation to the gene expression data allows the detection overrepresented patterns in the data, in contrast to looking only for genes that exhibit maximal differential expression. These challenges are approached by introducing an algorithm based on a new symbolic representation that searches for concerted gene expression patterns; furthermore, the symbolic representation takes into account the variance in multiple replicates and can be applied to long time series data. The proposed algorithm's ability to discover biologically relevant signals in gene expression data is exhibited by applying it to three datasets that measure gene expression in the rat liver.

[1]  Kimberley D. Wood Exploring the new world , 1999 .

[2]  Magnus Rattray,et al.  Making sense of microarray data distributions , 2002, Bioinform..

[3]  Lawrence Hunter,et al.  Trajectory Clustering: A Non-Parametric Method for Grouping Gene Expression Time Courses with Applications to Mammary Development , 2002, Pacific Symposium on Biocomputing.

[4]  Dennis B. Troup,et al.  NCBI GEO: archive for high-throughput functional genomic data , 2008, Nucleic Acids Res..

[5]  D. Venter,et al.  Chips with everything: DNA microarrays in infectious diseases. , 2004, The Lancet. Infectious diseases.

[6]  Ziv Bar-Joseph,et al.  STEM: a tool for the analysis of short time series gene expression data , 2006, BMC Bioinformatics.

[7]  W. Mitch,et al.  Glucocorticoids and acidosis stimulate protein and amino acid catabolism in vivo. , 1996, Kidney international.

[8]  S. Reppert,et al.  Molecular analysis of mammalian circadian rhythms. , 2001, Annual review of physiology.

[9]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[10]  L. Hubert,et al.  Comparing partitions , 1985 .

[11]  D. Stephens,et al.  A Quantitative Study of Gene Regulation Involved in the Immune Response of Anopheline Mosquitoes , 2006 .

[12]  P. Stewart,et al.  11β-Hydroxysteroid dehydrogenase and the pre-receptor regulation of corticosteroid hormone action , 2005 .

[13]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  U. Albrecht,et al.  Robust Circadian Rhythmicity of Per1 and Per2 Mutant Mice in Constant Light, and Dynamics of Per1 and Per2 Gene Expression under Long and Short Photoperiods , 2002, Journal of biological rhythms.

[15]  R R Almon,et al.  Extracting Global System Dynamics of Corticosteroid Genomic Effects in Rat Liver , 2008, Journal of Pharmacology and Experimental Therapeutics.

[16]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[17]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  G. Churchill Fundamentals of experimental design for cDNA microarrays , 2002, Nature Genetics.

[19]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[20]  Eric P Hoffman,et al.  Microarray analysis of the temporal response of skeletal muscle to methylprednisolone: comparative analysis of two dosing regimens. , 2007, Physiological genomics.

[21]  Eric Yang,et al.  Circadian Variations in Rat Liver Gene Expression: Relationships to Drug Actions , 2008, Journal of Pharmacology and Experimental Therapeutics.

[22]  Robert C. Wolpert,et al.  A Review of the , 1985 .

[23]  C. Finney,et al.  A review of symbolic analysis of experimental data , 2003 .

[24]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Yuting Liu,et al.  Analysis of Gene Regulatory Networks in the Mammalian Circadian Rhythm , 2008, PLoS Comput. Biol..

[26]  Yeung Sam Hung,et al.  Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient , 2008, BMC Bioinformatics.

[27]  John B. Hogenesch,et al.  Mop3 Is an Essential Component of the Master Circadian Pacemaker in Mammals , 2000, Cell.

[28]  I. Androulakis,et al.  Analysis of time-series gene expression data: methods, challenges, and opportunities. , 2007, Annual review of biomedical engineering.

[29]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[30]  E. Wouters,et al.  Factors contributing to alterations in skeletal muscle and plasma amino acid profiles in patients with chronic obstructive pulmonary disease. , 2000, The American journal of clinical nutrition.

[31]  Jin Y. Jin,et al.  Modeling of Corticosteroid Pharmacogenomics in Rat Liver Using Gene Microarrays , 2003, Journal of Pharmacology and Experimental Therapeutics.

[32]  U. Das Essential fatty acids and osteoporosis. , 2000, Nutrition.

[33]  Ioannis P. Androulakis,et al.  Bioinformatics analysis of the early inflammatory response in a rat thermal injury model , 2007, BMC Bioinformatics.

[34]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[35]  Masamitsu Iino,et al.  System-level identification of transcriptional circuits underlying mammalian circadian clocks , 2005, Nature Genetics.

[36]  Jeffrey T Leek,et al.  The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. , 2007, Biostatistics.

[37]  Carlo Vercellis,et al.  Time Series Gene Expression Data Classification via L1-norm Temporal SVM , 2010, PRIB.

[38]  Marianna Pensky,et al.  Statistical Applications in Genetics and Molecular Biology A Bayesian Approach to Estimation and Testing in Time-course Microarray Experiments , 2011 .

[39]  Jessica Lin,et al.  Finding Motifs in Time Series , 2002, KDD 2002.