SGX-PySpark: Secure Distributed Data Analytics

Data analytics is central to modern online services, particularly those data-driven. Often this entails the processing of large-scale datasets which may contain private, personal and sensitive information relating to individuals and organisations. Particular challenges arise where cloud is used to store and process the sensitive data. In such settings, security and privacy concerns become paramount, as the cloud provider is trusted to guarantee the security of the services they offer, including data confidentiality. Therefore, the issue this work tackles is “How to securely perform data analytics in a public cloud?” To assist this question, we design and implement SGX-PySpark- a secure distributed data analytics system which relies on a trusted execution environment (TEE) such as Intel SGX to provide strong security guarantees. To build SGX-PySpark, we integrate PySpark - a widely used framework for data analytics in industry to support a wide range of queries, with SCONE - a shielded execution framework using Intel SGX.