Big Data Analytics with Spark

Born from a Berkeley graduate project, the Apache Spark library has grown to be the most broadly used big data analytics platform. While Spark integrates with the older Hadoop ecosystem, it provides much more intuitive, faster, and powerful abstractions for manipulating distributed data than MapReduce. In this workshop, we will cover the basics of the Spark library with the goal of getting participants up to speed so that they can use the library or teach it in courses that involve big data or distributed processing. Participants will work with examples that range from calculating basic summary statistics to using the Spark Machine Learning library for performing sophisticated machine learning analyses on large datasets. Tasks during the session will be performed on smaller samples using the Spark local standalone implementation on participant's laptops. We will also discuss how Spark can be run on a local or cloud-based cluster and point participants toward resources for setting up those environments for their students.