Novel Statistical Tools for Management of Public Databases Facilitate Community‐Wide Replicability and Control of False Discovery

Issues of publication bias, lack of replicability, and false discovery have long plagued the genetics community. Proper utilization of public and shared data resources presents an opportunity to ameliorate these problems. We present an approach to public database management that we term Quality Preserving Database (QPD). It enables perpetual use of the database for testing statistical hypotheses while controlling false discovery and avoiding publication bias on the one hand, and maintaining testing power on the other hand. We demonstrate it on a use case of a replication server for GWAS findings, underlining its practical utility. We argue that a shift to using QPD in managing current and future biological databases will significantly enhance the community's ability to make efficient and statistically sound use of the available data resources.

[1]  E. Lander,et al.  Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results , 1995, Nature Genetics.

[2]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[3]  P. Cheung Concannon P, Erlich HA, Julier C, Morahan G, Nerup J, Pociot F, Todd JA, Rich SS and the Type 1 Diabetes Genetics Consortium. Type 1 Diabetes: Evidence for Susceptibility Loci from Four Genome-Wide Linkage Scans in 1,435 Multiplex Families. , 2005 .

[4]  J. Ioannidis Why Most Published Research Findings Are False , 2005 .

[5]  J. Todd,et al.  The Type 1 Diabetes Genetics Consortium , 2006, Annals of the New York Academy of Sciences.

[6]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[7]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[8]  Dean P. Foster,et al.  α‐investing: a procedure for sequential control of expected false discoveries , 2008 .

[9]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[10]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[11]  Saharon Rosset,et al.  The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  E. Yong Replication studies: Bad copy , 2012, Nature.

[13]  S. Rosset,et al.  Generalized Alpha Investing: Definitions, Optimality Results, and Application to Public Databases , 2013, 1307.0522.

[14]  S. Rosset,et al.  Generalized α‐investing: definitions, optimality results and application to public databases , 2014 .