A COMPREHENSIVE QUALITY ASSESSMENT OF THE AFFYMETRIX U133A&B PROBESETS BY AN INTEGRATIVE GENOMIC AND CLINICAL DATA ANALYSIS APPROACH

SUMMARY Motivation: Insufficient reliability of expression measurements is key problem facing microarray experiments. The problem could originate from poor gene identification by the probe sequences, whose design may not consider the actual complexity of the human genome. Results: We re-estimated genome localization of the Affymetrix U133A and U133B GeneChip (initial) target sequences. We matched these sequences to gene and transcripts in the human genome. This resulted in the significant redefinition of specificity and uniqueness of more than 2500 GeneChip probesets. Among the rest target sequences, approximately one quarter overlapped with interspersed repeats that could cause cross- hybridization signals and errors in expression measurements. To test that hypothesis, we compared GeneChip microarray data from large groups of breast cancer patients differed by aggressiveness of tumor growth. In particular, for low- and high- aggressive tumors, we demonstrated that among the set of differentially expressed genes the probesets with of repeat-overlapped target sequences statistically significant underrepresented in compare to the probesets of repeat-free target sequences. In addition, 407 Affymetrix target sequences were incorrectly oriented relative to the genes they purportedly represented (anti-sense transcripts). Surprisingly, a large fraction of these "erroneous" sequences can be significantly associated with important regulatory biological processes, molecular functions and pathways. The all defined categories of probe sequences have been annotated in our local Affy Probes Mapping and Annotation (APMA) database. Our results allow us to re-identify many targets used in a microarray experiment and carry out biological classification of the anti-sense transcripts.