THE BERKELEY DATA ANALYSIS SYSTEM (BDAS): AN OPEN SOURCE PLATFORM FOR BIG DATA ANALYTICS

Abstract : The goal of this proposal was to deliver a modular open-source software stack that can support a new generation of large-scale analytic tools that provide answers over arbitrarily large datasets. This work was carried out by Berkeley's AMPLab, a research lab consisting of eleven faculty members and over 40 students. In addition to this grant, AMPLab (which ended in December 2016) was supported by industry affiliates and an NSF Expeditions grant. This grant was instrumental in improving our software stack, Berkeley Data Analytic System (BDAS), so that it can serve as a platform for the broader community. In particular, this grant enabled us to implement significant portions of the code-bases, integrate BDAS with commonly used tools, and make BDAS much easier to manage. In addition, it allowed us to extend the functionality of BDAS in several key area, including streaming, and query processing. Thanks to xData, BDAS has enjoyed a big success both in academia and industry. Today, Apache Spark is used by thousands of companies in production and counts over 400K meetup members worldwide, while Apache Mesos and Alluxio (formerly known as Tachyon) are used by hundreds of companies around the world.

[1]  Ion Stoica,et al.  CellIQ : Real-Time Cellular Network Analytics at Scale , 2015, NSDI.

[2]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[3]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[4]  Michael I. Jordan,et al.  On the Convergence Rate of Decomposable Submodular Function Minimization , 2014, NIPS.

[5]  Ali Ghodsi,et al.  Scalable atomic visibility with RAMP transactions , 2014, SIGMOD Conference.

[6]  Joseph K. Bradley,et al.  Parallel Double Greedy Submodular Maximization , 2014, NIPS.

[7]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[8]  Ion Stoica,et al.  Efficient Coflow Scheduling Without Prior Knowledge , 2015, SIGCOMM.

[9]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[10]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[11]  Adam Marcus,et al.  Argonaut: Macrotask Crowdsourcing for Complex Data Processing , 2015, Proc. VLDB Endow..

[12]  Sanjay Krishnan,et al.  Wisteria: Nurturing Scalable Data Cleaning Infrastructure , 2015, Proc. VLDB Endow..

[13]  Matei Zaharia,et al.  Matrix Computations and Optimization in Apache Spark , 2015, KDD.

[14]  Ion Stoica,et al.  Coflow: An Application Layer Abstraction for Cluster Networking , 2012 .

[15]  Gautam Kumar,et al.  pHost: distributed near-optimal datacenter transport over commodity network fabric , 2015, CoNEXT.

[16]  Liwen Sun,et al.  A Partitioning Framework for Aggressive Data Skipping , 2014, Proc. VLDB Endow..

[17]  Scott Shenker,et al.  The Case for Tiny Tasks in Compute Clusters , 2013, HotOS.

[18]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[19]  Anca D. Dragan,et al.  Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Ameet Talwalkar,et al.  Knowing when you're wrong: building fast and reliable approximate query processing systems , 2014, SIGMOD Conference.

[21]  Ion Stoica,et al.  G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data , 2015, SIGMOD Conference.

[22]  Ali Ghodsi,et al.  Coordination Avoidance in Database Systems , 2014, Proc. VLDB Endow..

[23]  Antti Jylhä,et al.  How carat affects user behavior: implications for mobile battery awareness applications , 2014, CHI.

[24]  Ali Ghodsi,et al.  Eventual Consistency Today: Limitations, Extensions, and Beyond , 2013 .

[25]  Archana Ganapathi,et al.  Analyzing Log Analysis: An Empirical Study of User Log Mining , 2014, LISA.

[26]  Reynold Xin,et al.  GraphX: Unifying Data-Parallel and Graph-Parallel Analytics , 2014, ArXiv.

[27]  Randy H. Katz,et al.  Faster Jobs in Distributed Data Processing using Multi-Task Learning , 2015, SDM.

[28]  Michael I. Jordan,et al.  SparkNet: Training Deep Networks in Spark , 2015, ICLR.

[29]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[30]  Lalit Jain,et al.  NEXT: A System for Real-World Development, Evaluation, and Application of Active Learning , 2015, NIPS.

[31]  Ion Stoica,et al.  Efficient coflow scheduling with Varys , 2015, SIGCOMM.

[32]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[33]  Randy H. Katz,et al.  Cake: enabling high-level SLOs on shared storage systems , 2012, SoCC '12.

[34]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[35]  Ion Stoica,et al.  BlowFish: Dynamic Storage-Performance Tradeoff in Data Stores , 2016, NSDI.

[36]  Tim Kraska,et al.  CrowdQ: Crowdsourced Query Understanding , 2013, CIDR.

[37]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[38]  Trevor Darrell,et al.  TSC-DL: Unsupervised trajectory segmentation of multi-modal surgical demonstrations with Deep Learning , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[39]  Zhao Zhang,et al.  Scientific computing meets big data technology: An astronomy use case , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[40]  Ali Ghodsi,et al.  The potential dangers of causal consistency and an explicit solution , 2012, SoCC '12.

[41]  Eugene Wu,et al.  CLAMShell: Speeding up Crowds for Low-latency Data Labeling , 2015, Proc. VLDB Endow..

[42]  Rishabh K. Iyer,et al.  Monotone Closure of Relaxed Constraints in Submodular Optimization: Connections Between Minimization and Maximization , 2014, UAI.

[43]  Akshay Vij,et al.  When is big data big enough? Implications of using GPS-based surveys for travel demand analysis , 2015 .

[44]  Mary Goldman,et al.  Rapid and efficient analysis of 20,000 RNA-seq samples with Toil , 2016, bioRxiv.

[45]  Ali Ghodsi,et al.  Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity , 2015, SIGMOD Conference.

[46]  Sanjay Krishnan,et al.  A methodology for learning, analyzing, and mitigating social influence bias in recommender systems , 2014, RecSys '14.

[47]  Sanjay Krishnan,et al.  ActiveClean: Interactive Data Cleaning For Statistical Modeling , 2016, Proc. VLDB Endow..

[48]  Randy H. Katz,et al.  Heterogeneity and dynamicity of clouds at scale: Google trace analysis , 2012, SoCC '12.

[49]  Ali Ghodsi,et al.  HAT, Not CAP: Towards Highly Available Transactions , 2013, HotOS.

[50]  Tim Kraska,et al.  PLANET: making progress with commit processing in unpredictable environments , 2014, SIGMOD Conference.

[51]  Scott Shenker,et al.  Universal Packet Scheduling , 2015, NSDI.

[52]  Ameet Talwalkar,et al.  A general bootstrap performance diagnostic , 2013, KDD.

[53]  Michael I. Jordan,et al.  A General Analysis of the Convergence of ADMM , 2015, ICML.

[54]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[55]  Ali Ghodsi,et al.  Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[56]  Ion Stoica,et al.  The Power of Choice in Data-Aware Cluster Scheduling , 2014, OSDI.

[57]  Sasu Tarkoma,et al.  Collaborative Energy Debugging for Mobile Devices , 2012, HotDep.

[58]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[59]  Ali Ghodsi,et al.  Bolt-on causal consistency , 2013, SIGMOD '13.

[60]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[61]  Michael I. Jordan,et al.  A Linearly-Convergent Stochastic L-BFGS Algorithm , 2015, AISTATS.

[62]  Peter Bailis,et al.  The network is reliable , 2014 .

[63]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..

[64]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[65]  Ion Stoica,et al.  PBS at work: advancing data management with consistency metrics , 2013, SIGMOD '13.

[66]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[67]  Martin J. Wainwright,et al.  Distributed Estimation of Generalized Matrix Rank: Efficient Algorithms and Lower Bounds , 2015, ICML.

[68]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[69]  Dimitris S. Papailiopoulos,et al.  Parallel Correlation Clustering on Big Graphs , 2015, NIPS.

[70]  Zhao Zhang,et al.  Rethinking Data-Intensive Science Using Scalable Analytics Systems , 2015, SIGMOD Conference.

[71]  Michael I. Jordan,et al.  The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox , 2014, CIDR.

[72]  Tim Kraska,et al.  Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views , 2015, Proc. VLDB Endow..

[73]  Srikanth Kandula,et al.  Leveraging endpoint flexibility in data-intensive clusters , 2013, SIGCOMM.

[74]  Eemil Lagerspetz,et al.  The company you keep: mobile malware infection rates and inexpensive risk indicators , 2013, WWW.

[75]  Stefanie Jegelka,et al.  Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets , 2014, NIPS.

[76]  Lior Pachter,et al.  The NIH BD2K center for big data in translational genomics , 2015, J. Am. Medical Informatics Assoc..

[77]  Archana Ganapathi,et al.  Building blocks for exploratory data analysis tools , 2013, IDEA@KDD.

[78]  Dimitris S. Papailiopoulos,et al.  Cyclades: Conflict-free Asynchronous Machine Learning , 2016, NIPS.

[79]  Michael I. Jordan,et al.  Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[80]  Gregory D. Hager,et al.  Transition State Clustering: Unsupervised Surgical Trajectory Segmentation for Robot Learning , 2017, ISRR.

[81]  S. Alspaugh Better Logging to Improve Interactive Data Analysis Tools , 2014 .

[82]  Ali Ghodsi,et al.  FairRide: Near-Optimal, Fair Cache Sharing , 2016, NSDI.

[83]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[84]  Randy H. Katz,et al.  FastLane: making short flows shorter with agile drop notification , 2015, SoCC.

[85]  Ion Stoica,et al.  Time-evolving graph processing at scale , 2016, GRADES '16.

[86]  Ion Stoica,et al.  Succinct: Enabling Queries on Compressed Data , 2015, NSDI.

[87]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[88]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.