anndata: Annotated data

anndata is a Python package for handling annotated data matrices in memory and on disk (github.com/theislab/anndata), positioned between pandas and xarray. anndata offers a broad range of computationally efficient features including, among others, sparse data support, lazy operations, and a PyTorch interface. Statement of need Generating insight from high-dimensional data matrices typically works through training models that annotate observations and variables via low-dimensional representations. In exploratory data analysis, this involves iterative training and analysis using original and learned annotations and task-associated representations. anndata offers a canonical data structure for book-keeping these, which is neither addressed by pandas (McKinney, 2010), nor xarray (Hoyer & Hamman, 2017), nor commonly-used modeling packages like scikit-learn (Pedregosa et al., 2011).

[1]  Fabian J Theis,et al.  Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape , 2021, Genome Biology.

[2]  O. Stegle,et al.  MUON: multimodal omics analysis framework , 2021, bioRxiv.

[3]  Aaron M. Streets,et al.  scvi-tools: a library for deep probabilistic analysis of single-cell omics data , 2021, bioRxiv.

[4]  Sidney M. Bell,et al.  cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices , 2021, bioRxiv.

[5]  Fabian J Theis,et al.  Squidpy: a scalable framework for spatial single cell analysis , 2021, bioRxiv.

[6]  Raphael Gottardo,et al.  Integrated analysis of multimodal single-cell data , 2020, Cell.

[7]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[8]  Fabian J Theis,et al.  Generalizing RNA velocity to transient cell states through dynamical modeling , 2019, Nature Biotechnology.

[9]  Raphael Gottardo,et al.  Orchestrating single-cell analysis with Bioconductor , 2019, Nature Methods.

[10]  Shila Ghazanfar,et al.  The human body at cellular resolution: the NIH Human Biomolecular Atlas Program , 2019, Nature.

[11]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[12]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[13]  Benjamin Haibe-Kains,et al.  Software for the integration of multi-omics experiments in Bioconductor , 2017, bioRxiv.

[14]  Stephan Hoyer,et al.  xarray: N-D labeled arrays and datasets in Python , 2017 .

[15]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[16]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[19]  Benjamin S. Baumer,et al.  Tidy data , 2022, Modern Data Science with R.