Statistical Comparison Framework and Visualization Scheme for Ranking-Based Algorithms in High-Throughput Genome-Wide Studies

As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.

[1]  A. Syvänen,et al.  Silhouette scores for assessment of SNP genotype clusters , 2005, BMC Genomics.

[2]  Francisco Azuaje,et al.  A cluster validity framework for genome expression data , 2002, Bioinform..

[3]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[4]  Edward R. Dougherty,et al.  Model-based evaluation of clustering validation measures , 2007, Pattern Recognit..

[5]  David W Craig,et al.  Identification of disease causing loci using an array-based genotyping approach on pooled DNA , 2005, BMC Genomics.

[6]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[7]  Michael Owen,et al.  Cheap, accurate and rapid allele frequency estimation of single nucleotide polymorphisms by primer extension and DHPLC in DNA pools , 2000, Human Genetics.

[8]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[9]  Rebecca F. Halperin,et al.  Identification of the genetic basis for complex disorders by use of pooling-based genomewide single-nucleotide-polymorphism association studies. , 2007, American journal of human genetics.

[10]  Stefan Kammerer,et al.  Association testing by DNA pooling: An effective initial screen , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.

[12]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[13]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[14]  Hsin-Chou Yang,et al.  MPDA: Microarray pooled DNA analyzer , 2008, BMC Bioinformatics.