A novel data mining method to identify assay-specific signatures in functional genomic studies

Background:The highly dimensional data produced by functional genomic (FG) studies makes it difficult to visualize relationships between gene products and experimental conditions (i.e., assays). Although dimensionality reduction methods such as principal component analysis (PCA) have been very useful, their application to identify assay-specific signatures has been limited by the lack of appropriate methodologies. This article proposes a new and powerful PCA-based method for the identification of assay-specific gene signatures in FG studies.Results:The proposed method (PM) is unique for several reasons. First, it is the only one, to our knowledge, that uses gene contribution, a product of the loading and expression level, to obtain assay signatures. The PM develops and exploits two types of assay-specific contribution plots, which are new to the application of PCA in the FG area. The first type plots the assay-specific gene contribution against the given order of the genes and reveals variations in distribution between assay-specific gene signatures as well as outliers within assay groups indicating the degree of importance of the most dominant genes. The second type plots the contribution of each gene in ascending or descending order against a constantly increasing index. This type of plots reveals assay-specific gene signatures defined by the inflection points in the curve. In addition, sharp regions within the signature define the genes that contribute the most to the signature. We proposed and used the curvature as an appropriate metric to characterize these sharp regions, thus identifying the subset of genes contributing the most to the signature. Finally, the PM uses the full dataset to determine the final gene signature, thus eliminating the chance of gene exclusion by poor screening in earlier steps. The strengths of the PM are demonstrated using a simulation study, and two studies of real DNA microarray data – a study of classification of human tissue samples and a study of E. coli cultures with different medium formulations.ConclusionWe have developed a PCA-based method that effectively identifies assay-specific signatures in ranked groups of genes from the full data set in a more efficient and simplistic procedure than current approaches. Although this work demonstrates the ability of the PM to identify assay-specific signatures in DNA microarray experiments, this approach could be useful in areas such as proteomics and metabolomics.

[1]  Kohji Kawamoto,et al.  Characterization of the signal transduction via EvgS and EvgA in. Escherichia coli , 1996 .

[2]  Joachim Kopka,et al.  Lotus japonicus Metabolic Profiling. Development of Gas Chromatography-Mass Spectrometry Resources for the Study of Plant-Microbe Interactions , 2005, Plant Physiology.

[3]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[4]  D. Georgellis,et al.  Identification of UvrY as the Cognate Response Regulator for the BarA Sensor Kinase in Escherichia coli * , 2001, The Journal of Biological Chemistry.

[5]  M. Saraste,et al.  FEBS Lett , 2000 .

[6]  M. Boguski,et al.  Functional genomics: it's all how you read it. , 1997, Science.

[7]  Truman R. Brown,et al.  Normalization of single-channel DNA array data by principal component analysis , 2004, Bioinform..

[8]  M. Martínez-Vicente,et al.  The GTPase Activity and C-terminal Cysteine of the Escherichia coli MnmE Protein Are Essential for Its tRNA Modifying Function* , 2003, Journal of Biological Chemistry.

[9]  M. Bébien,et al.  Involvement of a putative molybdenum enzyme in the reduction of selenate by Escherichia coli. , 2002, Microbiology.

[10]  A. Giuliani,et al.  The main biological determinants of tumor line taxonomy elucidated by a principal component analysis of microarray data , 2001, FEBS letters.

[11]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[12]  P. Camilleri,et al.  Principal component analysis of mass spectra of peptides generated from the tryptic digestion of protein mixtures. , 2001, Rapid communications in mass spectrometry : RCM.

[13]  William A. Schmitt,et al.  Interactive exploration of microarray gene expression patterns in a reduced dimensional space. , 2002, Genome research.

[14]  B. Hove-Jensen,et al.  d-Allose Catabolism ofEscherichia coli: Involvement of alsI and Regulation of als Regulon Expression by Allose and Ribose , 1999, Journal of bacteriology.

[15]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[16]  R. Raines,et al.  Identifying latent enzyme activities: substrate ambiguity within modern bacterial sugar kinases. , 2004, Biochemistry.

[17]  Chiara Romualdi,et al.  Differential expression of genes coding for ribosomal proteins in different human tissues , 2001, Bioinform..

[18]  C. Park,et al.  The D-allose operon of Escherichia coli K-12 , 1997, Journal of bacteriology.

[19]  M. Riley,et al.  Interim report on genomics of Escherichia coli. , 2000, Annual review of microbiology.

[20]  Leo L. Cheng,et al.  Metabolic characterization of human prostate cancer with tissue magnetic resonance spectroscopy. , 2005, Cancer research.

[21]  Ramon Gonzalez,et al.  DNA Microarrays: Experimental Issues, Data Analysis, and Application to Bacterial Systems , 2004, Biotechnology progress.

[22]  Riccardo Leardi,et al.  Application of three-way principal component analysis to the evaluation of two-dimensional maps in proteomics. , 2003, Journal of proteome research.

[23]  Ramon Gonzalez,et al.  Gene Array‐Based Identification of Changes That Contribute to Ethanol Tolerance in Ethanologenic Escherichia coli: Comparison of KO11 (Parent) to LY01 (Resistant Mutant) , 2003, Biotechnology progress.

[24]  Hak-Sung Kim,et al.  Functional Expression and Characterization of the Two Cyclic Amidohydrolase Enzymes, Allantoinase and a Novel Phenylhydantoinase, from Escherichia coli , 2000, Journal of bacteriology.

[25]  M. de Pedro,et al.  Involvement of N‐acetylmuramyl‐l‐alanine amidases in cell separation and antibiotic‐induced autolysis of Escherichia coli , 2001, Molecular microbiology.

[26]  Eva Cusa,et al.  Genetic Analysis of a Chromosomal Region Containing Genes Required for Assimilation of Allantoin Nitrogen and Linked Glyoxylate Metabolism in Escherichia coli , 1999, Journal of bacteriology.

[27]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[28]  G. Schneider,et al.  Crystal structure and reactivity of YbdL from Escherichia coli identify a methionine aminotransferase function , 2004, FEBS letters.

[29]  L. Ingram,et al.  Enhanced Trehalose Production Improves Growth of Escherichia coli under Osmotic Stress , 2005, Applied and Environmental Microbiology.

[30]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[31]  T. Abee,et al.  A possible role of ProP, ProU and CaiT in osmoprotection of Escherichia coli by carnitine , 1998, Journal of applied microbiology.

[32]  Elena Forte,et al.  A Novel Type of Nitric-oxide Reductase , 2002, The Journal of Biological Chemistry.

[33]  P. R. Gardner,et al.  Regulation of the Nitric Oxide Reduction Operon (norRVW) in Escherichia coli , 2003, The Journal of Biological Chemistry.

[34]  Huifeng Wu,et al.  Comparison of metabolic profiles from serum from hepatotoxin-treated rats by nuclear-magnetic-resonance-spectroscopy-based metabonomic analysis. , 2005, Analytical biochemistry.

[35]  D. Bechhofer,et al.  Bacillus subtilis YhaM, a Member of a New Family of 3′-to-5′ Exonucleases in Gram-Positive Bacteria , 2002, Journal of bacteriology.

[36]  B. Palsson,et al.  An expanded genome-scale model of Escherichia coli K-12 (iJR904 GSM/GPR) , 2003, Genome Biology.

[37]  S. Bijlsma,et al.  A combination of proteomics, principal component analysis and transcriptomics is a powerful tool for the identification of biomarkers for macrophage maturation in the U937 cell line , 2004, Proteomics.

[38]  B. Barquera,et al.  Deletion of one of two Escherichia coli genes encoding putative Na+/H+ exchangers (ycgO) perturbs cytoplasmic alkali cation balance at low osmolarity. , 2001, Microbiology.

[39]  J. Weiner,et al.  The Escherichia coli ynfEFGHI operon encodes polypeptides which are paralogues of dimethyl sulfoxide reductase (DmsABC). , 2003, Archives of biochemistry and biophysics.

[40]  William Saurin,et al.  Getting In or Out: Early Segregation Between Importers and Exporters in the Evolution of ATP-Binding Cassette (ABC) Transporters , 1999, Journal of Molecular Evolution.

[41]  J. Gerlt,et al.  Utilization of l-Ascorbate by Escherichia coli K-12: Assignments of Functions to Products of the yjf-sga and yia-sgb Operons , 2002, Journal of bacteriology.

[42]  Johan Nilsson,et al.  Rapid topology mapping of Escherichia coli inner-membrane proteins by prediction and PhoA/GFP fusion analysis , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Gary L Gilliland,et al.  Crystal structure of Escherichia coli protein ybgI, a toroidal structure with a dinuclear metal site , 2003 .

[44]  Nikos Kyrpides,et al.  Genomes OnLine Database (GOLD): a monitor of genome projects world-wide , 2001, Nucleic Acids Res..