MMTF—An efficient file format for the transmission, visualization, and analysis of macromolecular structures

Recent advances in experimental techniques have led to a rapid growth in complexity, size, and number of macromolecular structures that are made available through the Protein Data Bank. This creates a challenge for macromolecular visualization and analysis. Macromolecular structure files, such as PDB or PDBx/mmCIF files can be slow to transfer, parse, and hard to incorporate into third-party software tools. Here, we present a new binary and compressed data representation, the MacroMolecular Transmission Format, MMTF, as well as software implementations in several languages that have been developed around it, which address these issues. We describe the new format and its APIs and demonstrate that it is several times faster to parse, and about a quarter of the file size of the current standard format, PDBx/mmCIF. As a consequence of the new data representation, it is now possible to visualize structures with millions of atoms in a web browser, keep the whole PDB archive in memory or parse it within few minutes on average computers, which opens up a new way of thinking how to design and implement efficient algorithms in structural bioinformatics. The PDB archive is available in MMTF file format through web services and data that are updated on a weekly basis.

[1]  Lior Pachter,et al.  The NIH BD2K center for big data in translational genomics , 2015, J. Am. Medical Informatics Assoc..

[2]  Akira R. Kinjo,et al.  Protein Data Bank Japan (PDBj): updated user interfaces, resource description framework, analysis tools for large structures , 2016, Nucleic Acids Res..

[3]  Nicholas B Rego,et al.  3Dmol.js: molecular visualization with WebGL , 2014, Bioinform..

[4]  Abhik Mukhopadhyay,et al.  PDBe: improved accessibility of macromolecular structure data from PDB and EMDB , 2015, Nucleic Acids Res..

[5]  David S. Goodsell,et al.  The RCSB protein data bank: integrative view of protein, gene and 3D structural information , 2016, Nucleic Acids Res..

[6]  P. E. Bourne,et al.  WPDB– PC Windows‐based interrogation of macromolecular structure , 1995 .

[7]  Andreas Prlic,et al.  Web-based molecular graphics for large complexes , 2016, Web3D.

[8]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[9]  Ewen Callaway Data bank struggles as protein imaging ups its game , 2014, Nature.

[10]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[11]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[12]  James E. Gain,et al.  Efficient compression of molecular dynamics trajectory files , 2012, J. Comput. Chem..

[13]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[14]  Akira R. Kinjo,et al.  Molmil: a molecular viewer for the PDB and beyond , 2016, Journal of Cheminformatics.

[15]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[16]  E. Callaway The revolution will not be crystallized: a new method sweeps through structural biology , 2015, Nature.

[17]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[18]  Klaus Schulten,et al.  Mature HIV-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics , 2013, Nature.

[19]  Jose M. Duarte,et al.  Towards an efficient compression of 3D coordinates of macromolecular structures , 2017, PloS one.

[20]  David S. Goodsell,et al.  The RCSB Protein Data Bank: redesigned web site and web services , 2010, Nucleic Acids Res..

[21]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[22]  Zukang Feng,et al.  The chemical component dictionary: complete descriptions of constituent molecules in experimentally determined 3D macromolecules in the Protein Data Bank , 2015, Bioinform..

[23]  Erik Lindahl,et al.  An efficient and extensible format, library, and API for binary trajectory data from molecular simulations , 2014, J. Comput. Chem..

[24]  Peter Dittrich,et al.  Compressing molecular dynamics trajectories: Breaking the one‐bit‐per‐sample barrier , 2016, J. Comput. Chem..