Uncovering hidden duplicated content in public transcriptomics data

As part of the development of the database Bgee (a dataBase for Gene Expression Evolution), we annotate and analyse expression data from different types and different sources, notably Affymetrix data from GEO and ArrayExpress, and RNA-Seq data from SRA. During our quality control procedure, we have identified duplicated content in GEO and ArrayExpress, affecting ∼14% of our data: fully or partially duplicated experiments from independent data submissions, Affymetrix chips reused in several experiments, or reused within an experiment. We present here the procedure that we have established to filter such duplicates from Affymetrix data, and our procedure to identify future potential duplicates in RNA-Seq data. Database URL: http://bgee.unil.ch/

[1]  Vico E Henriques Information processing standards , 1982 .

[2]  James H. Burrows,et al.  Secure Hash Standard , 1995 .

[3]  Wei-Min Liu,et al.  Analysis of high density expression microarrays with signed-rank call algorithms , 2002, Bioinform..

[4]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[5]  Zhijin Wu,et al.  Preprocessing of oligonucleotide array data , 2004, Nature Biotechnology.

[6]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[7]  Michael G. Barnes,et al.  Genome-Level Longitudinal Expression of Signaling Pathways and Gene Networks in Pediatric Septic Shock , 2007, Molecular medicine.

[8]  S. Welle,et al.  Expression profile of FSHD supports a link between retinal vasculopathy and muscular dystrophy , 2007, Neurology.

[9]  Sébastien Moretti,et al.  Bgee: Integrating and Comparing Heterogeneous Transcriptome Data Among Species , 2008, DILS.

[10]  Su Guo,et al.  Identification of Spt5 Target Genes in Zebrafish Development Reveals Its Dual Activity In Vivo , 2008, PloS one.

[11]  Stephen Welle,et al.  Sex-Related Differences in Gene Expression in Human Skeletal Muscle , 2008, PloS one.

[12]  Robert J Freishtat,et al.  Validating the genomic signature of pediatric septic shock. , 2008, Physiological genomics.

[13]  Thomas P. Shanley,et al.  Genomic expression profiling across the pediatric systemic inflammatory response syndrome, sepsis, and septic shock spectrum* , 2009, Critical care medicine.

[14]  Robert J Freishtat,et al.  BMC Medicine BioMed Central , 2009 .

[15]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[16]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[17]  Ibrahim Emam,et al.  ArrayExpress update—an archive of microarray and high-throughput sequencing-based functional genomics experiments , 2010, Nucleic Acids Res..

[18]  Michael T. Bigham,et al.  The Influence of Developmental Age on the Early Transcriptomic Response of Children with Septic Shock , 2011, Molecular medicine.

[19]  Quynh H. Dang,et al.  Secure Hash Standard | NIST , 2015 .

[20]  Shirley M. Radack,et al.  Secure Hash Standard: Updated Specifications Approved and Issued as Federal Information Processing Standard (FIPS) 180-4 | NIST , 2012 .

[21]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[22]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .