The Berlin Big Data Center (BBDC)

Abstract The last decade has been characterized by the collection and availability of unprecedented amounts of data due to rapidly decreasing storage costs and the omnipresence of sensors and data-producing global online-services. In order to process and analyze this data deluge, novel distributed data processing systems resting on the paradigm of data flow such as Apache Hadoop, Apache Spark, or Apache Flink were built and have been scaled to tens of thousands of machines. However, writing efficient implementations of data analysis programs on these systems requires a deep understanding of systems programming, prohibiting large groups of data scientists and analysts from efficiently using this technology. In this article, we present some of the main achievements of the research carried out by the Berlin Big Data Cente (BBDC). We introduce the two domain-specific languages Emma and LARA, which are deeply embedded in Scala and enable declarative specification and the automatic parallelization of data analysis programs, the PEEL Framework for transparent and reproducible benchmark experiments of distributed data processing systems, approaches to foster the interpretability of machine learning models and finally provide an overview of the challenges to be addressed in the second phase of the BBDC.

[1]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[2]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[3]  Klaus-Robert Müller,et al.  Machine learning of accurate energy-conserving molecular force fields , 2016, Science Advances.

[4]  Klaus-Robert Müller,et al.  Interpretable human action recognition in compressed domain , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Volker Markl,et al.  Emma in Action: Declarative Dataflows for Scalable Data Analysis , 2016, SIGMOD Conference.

[6]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[7]  Tilmann Rabl,et al.  Benchmarking Data Flow Systems for Scalable Machine Learning , 2017, BeyondMR@SIGMOD.

[8]  Sebastian Schelter,et al.  Automatically Tracking Metadata and Provenance of Machine Learning Experiments , 2017 .

[9]  Berthold Reinwald,et al.  Efficient sample generation for scalable meta learning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[10]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[11]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[12]  J. Vybíral,et al.  Big data of materials science: critical role of the descriptor. , 2014, Physical review letters.

[13]  Tilmann Rabl,et al.  BlockJoin: Efficient Matrix Partitioning Through Joins , 2017, Proc. VLDB Endow..

[14]  Volker Markl,et al.  Bridging the gap: towards optimization across linear and relational algebra , 2016, BeyondMR@SIGMOD.

[15]  Alexander Binder,et al.  Analyzing Classifiers: Fisher Vectors and Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tilmann Rabl,et al.  Apache Flink in current research , 2016, it Inf. Technol..

[17]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[18]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[19]  Volker Markl,et al.  Implicit Parallelism through Deep Language Embedding , 2016, SGMD.