The International Cancer Genome Consortium Data Portal

To the Editor — The International Cancer Genome Consortium (ICGC) is a global initiative to build a comprehensive catalog of mutational abnormalities in the major tumor types. Here we present the ICGC’s Data Portal, a user-friendly platform for efficient visualization, analysis and interpretation of large, diverse cancer datasets. The portal currently contains data from 84 worldwide cancer projects, collectively representing about 77 million somatic mutations and molecular data from over 20,000 contributors. We use scalable big-data technologies to overcome the challenges of storing, annotating and exploring large and complex datasets1, thereby facilitating powerful integrative analyses that may shed new light on cancer biology. For example, the integration of large numbers of tumor genomes in the ICGC portal will enable the identification of rare molecular subtypes that have distinctive clinical behavior. Ultimately, this may lead to the development of new and better diagnostic tools, as well as more targeted therapies and drugs. Advances in sequencing and molecular profiling technologies have rapidly accelerated the generation of cancerrelated genetic, molecular and clinical data. Although the original ICGC portal system, based on a traditional SQL database2, supported contributions to the ICGC project for the first three years, it was unable to support the growing data demands. To overcome these challenges, the ICGC Data Portal was developed, with not only highly efficient search algorithms for interactive querying and browsing but also intuitive and powerful user interfaces to help users interpret complex molecular and associated clinical data. The ICGC Data Portal software ecosystem consists of several components. These include the data submission system, an extract–transform–load (ETL) pipeline, an optimized data model, the repository indexer, the data download system, and the ICGC Data Portal user interface. These components are based on distributed analysis and index-based technologies, including Hadoop MapReduce, Hadoop Distributed File System (HDFS), Spark, MongoDB and Elasticsearch, which allow computation to be parallelized and distributed for improved speed and scalability3. The data submission system supports 11 molecular data types (Supplementary Table 1), as well as clinical and biospecimen data for each donor. This Java web-based application allows ICGC’s members to transfer their submission files to the ICGC Data Coordination Center and validate them against the release data dictionary (Supplementary Fig. 1). Users can review detailed validation reports and sign off submissions when they are satisfied with their quality. We use the open source big-data frameworks Hadoop (http://hadoop.apache. org/) and Cascading (https://www.cascading. org/) (Supplementary Note 1) to support the increasingly large datasets and concurrent data submissions from more than 80 projects. Across 414 GB of data files entered through the ICGC submission system during release 27, over 9 billion individual data element validations were processed. The ETL pipeline, a 13-step process performed on all the data in the portal (Supplementary Note 2), generates a new data release every four months. During this process, variants are annotated with bioinformatic tools such as SNPEFF4 and FATHMM5, and objects are linked with external resources such as the COSMIC Cancer Gene Census6, Reactome pathways7, Gene Ontology8,9, Ensembl (release 75)10, UniProtKB/Swiss-Prot11, ZINC compound database12, and clinical annotation from CIVIC13 and ClinVar14. The ETL was rebuilt on the Apache Spark platform (https://spark.apache.org/), which led to a notable reduction in processing time, from 5 days (release 16) to an average of 16 hours (release 27) with more than five times as much submitted data (70 GB and 414 GB, respectively). At the end of the ETL, donors, genes and annotated somatic mutations are collated into indices, which enable efficient data searching through the portal. The ICGC dataset comprises several high-level data types, including donor, gene, mutation and cancer drug. To support fast, simultaneous querying of these entities and their relationships, we developed a sophisticated search framework based on Fig. 1 | The ICGC Data Portal Facet Search interface. Queryable variables are shown with field values and aggregate counts as facets on the left-hand panel. Selecting facets causes the table and summary graphics on the right-hand panel to update in real time. Tables contain hyperlinked lists of associated entity counts. Users can save query results as three distinct sets of donor, gene and mutation entities for further in-browser analyses and visualizations using ‘Save/Edit Gene Results’.