We present a new Python library called vaex, to handle extremely large tabular datasets, such as astronomical catalogues like the Gaia catalogue, N-body simulations or any other regular datasets which can be structured in rows and columns. Fast computations of statistics on regular N-dimensional grids allows analysis and visualization in the order of a billion rows per second. We use streaming algorithms, memory mapped files and a zero memory copy policy to allow exploration of datasets larger than memory, e.g. out-of-core algorithms. Vaex allows arbitrary (mathematical) transformations using normal Python expressions and (a subset of) numpy functions which are lazily evaluated and computed when needed in small chunks, which avoids wasting of RAM. Boolean expressions (which are also lazily evaluated) can be used to explore subsets of the data, which we call selections. Vaex uses a similar DataFrame API as Pandas, a very popular library, which helps migration from Pandas. Visualization is one of the key points of vaex, and is done using binned statistics in 1d (e.g. histogram), in 2d (e.g. 2d histograms with colormapping) and 3d (using volume rendering). Vaex is split in in several packages: vaex-core for the computational part, vaex-viz for visualization mostly based on matplotlib, vaex-jupyter for visualization in the Jupyter notebook/lab based in IPyWidgets, vaex-server for the (optional) client-server communication, vaex-ui for the Qt based interface, vaex-hdf5 for hdf5 based memory mapped storage, vaex-astro for astronomy related selections, transformations and memory mapped (column based) fits storage. Vaex is open source and available under MIT license on github, documentation and other information can be found on the main website: this https URL, this https URL or this https URL
[1]
John D. Hunter,et al.
Matplotlib: A 2D Graphics Environment
,
2007,
Computing in Science & Engineering.
[2]
Gaël Varoquaux,et al.
Scikit-learn: Machine Learning in Python
,
2011,
J. Mach. Learn. Res..
[3]
R. A. Leibler,et al.
On Information and Sufficiency
,
1951
.
[4]
Durham,et al.
The Aquarius Project: the subhaloes of galactic haloes
,
2008,
0809.0898.
[5]
C. Frenk,et al.
The Aquarius Project : the subhalos of galactic halos
,
2008
.
[6]
Amina Helmi,et al.
Mapping the substructure in the Galactic halo with the next generation of astrometric satellites
,
2000,
astro-ph/0007166.
[7]
John A. Taylor,et al.
SAMP — Simple Application Messaging Protocol Version 1.11
,
2009
.
[8]
David Giaretta,et al.
IVOA Recommendation: VOTable Format Definition Version 1.3
,
2011
.
[9]
Observatoire de la Côte d'Azur,et al.
Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties
,
2016,
1609.04172.
[10]
Prasanth H. Nair,et al.
Astropy: A community Python package for astronomy
,
2013,
1307.6212.
[11]
Alyssa A. Goodman,et al.
Principles of high‐dimensional data visualization in astronomy
,
2012,
1205.4747.
[12]
A. Krone-Martins,et al.
Gaia Data Release 1: The archive visualisation service
,
2017,
1708.00195.
[13]
Brian E. Granger,et al.
IPython: A System for Interactive Scientific Computing
,
2007,
Computing in Science & Engineering.
[14]
Mehdi Amini,et al.
Pythran: Enabling Static Optimization of Scientific Python Programs
,
2013,
SciPy.