The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data

The Cancer Genomics Hub (CGHub) is the online repository of the sequencing programs of the National Cancer Institute (NCI), including The Cancer Genomics Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) projects, with data from 25 different types of cancer. The CGHub currently contains >1.4 PB of data, has grown at an average rate of 50 TB a month and serves >100 TB per week. The architecture of CGHub is designed to support bulk searching and downloading through a Web-accessible application programming interface, enforce patient genome confidentiality in data storage and transmission and optimize for efficiency in access and transfer. In this article, we describe the design of these three components, present performance results for our transfer protocol, GeneTorrent, and finally report on the growth of the system in terms of data stored and transferred, including estimated limits on the current architecture. Our experienced-based estimates suggest that centralizing storage and computational resources is more efficient than wide distribution across many satellite labs. Database URL: https://cghub.ucsc.edu

[1]  M. N. Marsono,et al.  Hardware Acceleration of OpenSSL Cryptographic Functions for High-Performance Internet Security , 2010, 2010 International Conference on Intelligent Systems, Modelling and Simulation.

[2]  David Haussler,et al.  CGHub: Kick-starting the Worldwide Genome Web , 2013 .

[3]  D. Haussler,et al.  Assembly of the working draft of the human genome with GigAssembler. , 2001, Genome research.

[4]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[5]  Di Wu,et al.  Unraveling the BitTorrent Ecosystem , 2011, IEEE Transactions on Parallel and Distributed Systems.

[6]  Theodore V. Vorburger,et al.  Project Report (1998-99) of NIST Standard Bullets and Casings (National Institute of Standards and Technology, Gaithersburg, MD) , 2000 .

[7]  Robert L. Grossman,et al.  UDT: UDP-based data transfer for high-speed wide area networks , 2007, Comput. Networks.

[8]  Patricia J. Teller,et al.  Proceedings of the 2008 ACM/IEEE conference on Supercomputing , 2008, HiPC 2008.

[9]  A. Al Hasib,et al.  A Comparative Study of the Performance and Security Issues of AES and RSA Cryptography , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[10]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[11]  Eddie Kohler,et al.  Exploiting BitTorrent For Fun , 2006, IPTPS.

[12]  Herb Sutter,et al.  The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software , 2013 .

[13]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[14]  J. Dongarra,et al.  The Impact of Multicore on Computational Science Software , 2007 .

[15]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[16]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .