PyBDA: a command line tool for automated analysis of big biological data sets

BackgroundAnalysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points.ResultsWe developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells.ConclusionPyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.

[1]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[2]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[3]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[4]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[5]  Sahil R. Kalra,et al.  Big Challenges? Big Data … , 2015 .

[6]  Nick Golding,et al.  greta: simple and scalable statistical modelling in R , 2019, J. Open Source Softw..

[7]  Sven Rahmann,et al.  Genome analysis , 2022 .

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[10]  Francisco Herrera,et al.  A comparison on scalability for batch big data processing on Apache Spark and Apache Flink , 2017 .

[11]  Shaoliang Peng,et al.  Bioinformatics applications on Apache Spark , 2018, GigaScience.

[12]  S. Geer,et al.  Statistics for big data: A perspective , 2018 .

[13]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14]  Reynold Xin,et al.  Apache Spark , 2016 .

[15]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[16]  Dustin Tran,et al.  Edward: A library for probabilistic modeling, inference, and criticism , 2016, ArXiv.

[17]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[18]  Avita Katal,et al.  Big data: Issues, challenges, tools and Good practices , 2013, 2013 Sixth International Conference on Contemporary Computing (IC3).

[19]  Alexis Boukouvalas,et al.  GPflow: A Gaussian Process Library using TensorFlow , 2016, J. Mach. Learn. Res..