The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices

Motivation Facilitated by technological improvements, pharmacologic and genetic perturbational datasets have grown in recent years to include millions of experiments. Sharing and publicly distributing these diverse data creates many opportunities for discovery, but in recent years the unprecedented size of data generated and its complex associated metadata have also created data storage and integration challenges. Results We present the GCTx file format and a suite of open‐source packages for the efficient storage, serialization and analysis of dense two‐dimensional matrices. We have extensively used the format in the Connectivity Map to assemble and share massive datasets currently comprising 1.3 million experiments, and we anticipate that the format's generalizability, paired with code libraries that we provide, will lower barriers for integrated cross‐assay analysis and algorithm development. Availability and implementation Software packages (available in Python, R, Matlab and Java) are freely available at https://github.com/cmap. Additional instructions, tutorials and datasets are available at clue.io/code. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Anne E Carpenter,et al.  Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes , 2016, Nature Protocols.

[2]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[3]  Angela N. Brooks,et al.  A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles , 2017 .

[4]  Aravind Subramanian,et al.  Reduced-representation Phosphosignatures Measured by Quantitative Targeted MS Capture Cellular States and Enable Large-scale Comparison of Drug-induced Phenotypes* , 2016, Molecular & Cellular Proteomics.

[5]  Bernd Fischer,et al.  CellH5: a format for data exchange in high-content screening , 2013, Bioinform..

[6]  Jacob K. Asiedu,et al.  The Drug Repurposing Hub: a next-generation drug library and information resource , 2017, Nature Medicine.

[7]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Amar Koleti,et al.  Metadata Standard and Data Exchange Specifications to Describe, Model, and Integrate Complex and Diverse High-Throughput Screening Data from the Library of Integrated Network-based Cellular Signatures (LINCS) , 2014, Journal of biomolecular screening.

[9]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[10]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[11]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[12]  Mario Niepel,et al.  Adaptive informatics for multi-factorial and high content biological data , 2011, Nature Methods.

[13]  Ravi Iyengar,et al.  The Library of Integrated Network-Based Cellular Signatures NIH Program: System-Level Cataloging of Human Cells Response to Perturbations. , 2017, Cell systems.