Sequence biases in large scale gene expression profiling data

We present the results of a simple, statistical assay that measures the G+C content sensitivity bias of gene expression experiments without the requirement of a duplicate experiment. We analyse five gene expression profiling methods: Affymetrix GeneChip, Long Serial Analysis of Gene Expression (LongSAGE), LongSAGELite, ‘Classic’ Massively Parallel Signature Sequencing (MPSS) and ‘Signature’ MPSS. We demonstrate the methods have systematic and random errors leading to a different G+C content sensitivity. The relationship between this experimental error and the G+C content of the probe set or tag that identifies each gene influences whether the gene is detected and, if detected, the level of gene expression measured. LongSAGE has the least bias, while Signature MPSS shows a strong bias to G+C rich tags and Affymetrix data show different bias depending on the data processing method (MAS 5.0, RMA or GC-RMA). The bias in the Affymetrix data primarily impacts genes expressed at lower levels. Despite the larger sampling of the MPSS library, SAGE identifies significantly more genes (60% more RefSeq genes in a single comparison).

[1]  Rithy K. Roth,et al.  Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays , 2000, Nature Biotechnology.

[2]  A. Sparks,et al.  Using the transcriptome to annotate the genome , 2002, Nature Biotechnology.

[3]  E. Snyder,et al.  Reproducibility, bioinformatic analysis and power of the SAGE method to evaluate changes in transcriptome , 2005, Nucleic acids research.

[4]  J. Claverie,et al.  The significance of digital gene expression profiles. , 1997, Genome research.

[5]  Martin Ester,et al.  Assessment and integration of publicly available SAGE, cDNA microarray, and oligonucleotide microarray expression data for global coexpression analyses. , 2005, Genomics.

[6]  Shivakundan Singh Tej,et al.  Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing , 2004, Nature Biotechnology.

[7]  J. Pronk,et al.  Reproducibility of Oligonucleotide Microarray Transcriptome Analyses , 2002, The Journal of Biological Chemistry.

[8]  Barry Merriman,et al.  A comparison of gene expression profiles produced by SAGE, long SAGE, and oligonucleotide chips. , 2004, Genomics.

[9]  Blake C Meyers,et al.  The use of MPSS for whole-genome transcriptional analysis in Arabidopsis. , 2004, Genome research.

[10]  E. Southern,et al.  Molecular interactions on microarrays , 1999, Nature Genetics.

[11]  Rafael A. Irizarry,et al.  Stochastic models inspired by hybridization theory for short oligonucleotide arrays , 2004, J. Comput. Biol..

[12]  Musa H. Asyali,et al.  Reliability analysis of microarray data using fuzzy c-means and normal mixture modeling based classification methods , 2005, Bioinform..

[13]  Lukasz Huminiecki,et al.  Congruence of tissue expression profiles from Gene Expression Atlas, SAGEmap and TissueInfo databases , 2003, BMC Genomics.

[14]  K. Kinzler,et al.  Analysing uncharted transcriptomes with SAGE. , 2000, Trends in genetics : TIG.

[15]  Peter Winter,et al.  Gene expression analysis of plant host–pathogen interactions by SuperSAGE , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Isaac S. Kohane,et al.  Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces cross-platform inconsistencies in cancer-associated gene expression measurements , 2005, BMC Bioinformatics.

[17]  M. Gerstein,et al.  Relationship between gene co-expression and probe localization on microarray slides , 2003, BMC Genomics.

[18]  Sarah Barber,et al.  A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Lucila Ohno-Machado,et al.  Analysis of matched mRNA measurements from two different microarray technologies , 2002, Bioinform..

[20]  Kenneth H Buetow,et al.  An anatomy of normal and malignant gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Dennis B. Troup,et al.  NCBI GEO: mining millions of expression profiles—database and tools , 2004, Nucleic Acids Res..

[22]  Z. Szallasi,et al.  Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements. , 2004, Nucleic acids research.

[23]  Petri Auvinen,et al.  Are data from different gene expression microarray platforms comparable? , 2004, Genomics.

[24]  Piero Carninci,et al.  Tag-based approaches for transcriptome research and genome annotation , 2005, Nature Methods.

[25]  E. H. Margulies,et al.  Identification and prevention of a GC content bias in SAGE libraries. , 2001, Nucleic acids research.

[26]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence Project: update and current status , 2003, Nucleic Acids Res..

[27]  W. E. Orr,et al.  Comparing the use of Affymetrix to spotted oligonucleotide microarrays using two retinal pigment epithelium cell lines. , 2003, Molecular vision.

[28]  Hans Lehrach,et al.  A comparison of oligonucleotide and cDNA-based microarray systems. , 2004, Physiological genomics.

[29]  Jungwon Yoon,et al.  The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community , 2003, Nucleic Acids Res..

[30]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[31]  G. Helt,et al.  Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution , 2005, Science.

[32]  P. Perrotta,et al.  Transcript profiling of human platelets using microarray and serial analysis of gene expression. , 2003, Blood.

[33]  A. Kassam,et al.  Comprehensive transcript analysis in small quantities of mRNA by SAGE-lite. , 1999, Nucleic acids research.