SparkAC: Fine-Grained Access Control in Spark for Secure Data Sharing and Analytics

With the development of computing and communication technologies, an extremely large amount of data has been collected, stored, utilized, and shared, while new security and privacy challenges arise. Existing access control mechanisms provided by big data platforms have limitations in granularity and expressiveness. In this article, we present SparkAC, a novel access control mechanism for secure data sharing and analysis in Spark. In particular, we first propose a purpose-aware access control (PAAC) model, which introduces new concepts of data processing purpose and data operation purposeand an automatic purpose analysis algorithm that identifies purposes from data analytics operations and queries. Moreover, we develop a unified access control mechanism that implements PAAC model in two modules. GuardSpark++ supports structured data access control in Spark Catalyst and GuardDAG supports unstructured data access control in Spark core. Finally, we evaluate GuardSpark++ and GuardDAG with multiple data sources, applications, and data analytics engines. Experimental results show that SparkAC provides effective access control functionalities with very small (GuardSpark++) or medium (GuardDAG) performance overhead.

[1]  Yingjiu Li,et al.  GuardSpark++: Fine-Grained Purpose-Aware Access Control for Secure Data Sharing and Analysis in Spark , 2020, ACSAC.

[2]  Miryung Kim,et al.  BigFuzz: Efficient Fuzz Testing for Data Analytics Using Framework Abstraction , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[3]  Maanak Gupta,et al.  Next-Generation Big Data Federation Access Control: A Reference Model , 2019, Future Gener. Comput. Syst..

[4]  Yu Wen,et al.  Log2vec: A Heterogeneous Graph Embedding Based Approach for Detecting Cyber Threats within Enterprise , 2019, CCS.

[5]  Xin Wang,et al.  An Attribute-Based Fine-Grained Access Control Mechanism for HBase , 2019, DEXA.

[6]  Arthur W. Toga,et al.  Big data sharing and analysis to advance research in post-traumatic epilepsy , 2019, Neurobiology of Disease.

[7]  Martin Steinebach,et al.  New authentication concept using certificates for big data analytic tools , 2018, ARES.

[8]  Elena Ferrari,et al.  Access Control in the Era of Big Data: State of the Art and Research Directions , 2018, SACMAT.

[9]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[10]  Maanak Gupta,et al.  An Attribute-Based Access Control Model for Secure Big Data Processing in Hadoop Ecosystem , 2018 .

[11]  M. Zaharia,et al.  Spark: The Definitive Guide: Big Data Processing Made Simple , 2018 .

[12]  Zhaoquan Gu,et al.  Kakute: A Precise, Unified Information Flow Analysis System for Big-data Security , 2017, ACSAC.

[13]  Mohsen Guizani,et al.  MeDShare: Trust-Less Medical Data Sharing Among Cloud Service Providers via Blockchain , 2017, IEEE Access.

[14]  Ravi S. Sandhu,et al.  Object-Tagged RBAC Model for the Hadoop Ecosystem , 2017, DBSec.

[15]  Ravi S. Sandhu,et al.  POSTER: Access Control Model for the Hadoop Ecosystem , 2017, SACMAT.

[16]  Nicolas Bruno,et al.  Spanner: Becoming a SQL System , 2017, SIGMOD Conference.

[17]  Elena Ferrari,et al.  Towards a Unifying Attribute Based Access Control Approach for NoSQL Datastores , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[18]  Ion Stoica,et al.  Opaque: An Oblivious and Encrypted Distributed Analytics Platform , 2017, NSDI.

[19]  Wenke Lee,et al.  UniSan: Proactive Kernel Memory Initialization to Eliminate Data Leakages , 2016, CCS.

[20]  P.Hema Latha,et al.  A SECURE ANTI-COLLUSION DATA SHARING SCHEME FOR DYNAMIC GROUPS IN THE CLOUD , 2016 .

[21]  John Liagouris,et al.  Explaining Outputs in Modern Data Analytics , 2016, Proc. VLDB Endow..

[22]  Reynold Xin,et al.  GraphFrames: an integrated API for mixing graph and relational queries , 2016, GRADES '16.

[23]  Aniket Kate,et al.  Data Lineage in Malicious Environments , 2016, IEEE Transactions on Dependable and Secure Computing.

[24]  Josep Domingo-Ferrer,et al.  Privacy by design in big data: An overview of privacy enhancing technologies in the era of big data analytics , 2015, ArXiv.

[25]  Elena Ferrari,et al.  Privacy Aware Access Control for Big Data: A Research Roadmap , 2015, Big Data Res..

[26]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[27]  Wouter Joosen,et al.  SparkXS: Efficient Access Control for Intelligent and Large-Scale Streaming Data Applications , 2015, 2015 International Conference on Intelligent Environments.

[28]  Elisa Bertino,et al.  Big Data - Security and Privacy , 2015, 2015 IEEE International Congress on Big Data.

[29]  Michael J. Carey,et al.  The PigMix Benchmark on Pig, MapReduce, and HPCC Systems , 2015, 2015 IEEE International Congress on Big Data.

[30]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[31]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[32]  Murat Kantarcioglu,et al.  GuardMR: Fine-grained Security Policy Enforcement for MapReduce Systems , 2015, AsiaCCS.

[33]  Gail E. Kaiser,et al.  Phosphor: illuminating dynamic data flow in commodity jvms , 2014, OOPSLA.

[34]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[35]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[36]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[37]  Elena Ferrari,et al.  Enforcing Obligations within RelationalDatabase Management Systems , 2014, IEEE Transactions on Dependable and Secure Computing.

[38]  Andy Hopper,et al.  MrLazy: Lazy Runtime Label Propagation for MapReduce , 2014, HotCloud.

[39]  Veerle Van den Eynden,et al.  Managing and Sharing Research Data: A Guide to Good Practice , 2014 .

[40]  Elena Ferrari,et al.  Enforcement of Purpose Based Access Control within Relational Database Management Systems , 2014, IEEE Transactions on Knowledge and Data Engineering.

[41]  David F. Ferraiolo,et al.  Guide to Attribute Based Access Control (ABAC) Definition and Considerations , 2014 .

[42]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[43]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[44]  J. Gregory Morrisett,et al.  Bringing java's wild native world under control , 2013, TSEC.

[45]  Carlos V. Rozas,et al.  Innovative instructions and software model for isolated execution , 2013, HASP '13.

[46]  Ramarathnam Venkatesan,et al.  Secure database-as-a-service with Cipherbase , 2013, SIGMOD '13.

[47]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[48]  Jaehong Park,et al.  A provenance-based access control model , 2012, 2012 Tenth Annual International Conference on Privacy, Security and Trust.

[49]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[50]  Yannis Rouselakis,et al.  Property Preserving Symmetric Encryption , 2012, EUROCRYPT.

[51]  Hari Balakrishnan,et al.  CryptDB: protecting confidentiality with encrypted query processing , 2011, SOSP.

[52]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[53]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[54]  Adriane Chapman,et al.  Scalable Access Controls for Lineage , 2009, Workshop on the Theory and Practice of Provenance.

[55]  Ninghui Li,et al.  Purpose based access control for privacy protection in relational database systems , 2008, The VLDB Journal.

[56]  Jorge Lobo,et al.  An obligation model bridging access control policies and privacy policies , 2008, SACMAT '08.

[57]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[58]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[59]  Sushil Jajodia,et al.  The inference problem: a survey , 2002, SKDD.

[60]  Karl N. Levitt,et al.  Data level inference detection in database systems , 1998, Proceedings. 11th IEEE Computer Security Foundations Workshop (Cat. No.98TB100238).

[61]  Hemma Prafullchandra,et al.  Going Beyond the Sandbox: An Overview of the New Security Architecture in the Java Development Kit 1.2 , 1997, USENIX Symposium on Internet Technologies and Systems.

[62]  Ian Goldberg,et al.  A Secure Environment for Untrusted Helper Applications ( Confining the Wily Hacker ) , 1996 .

[63]  Ahmed M. Azab,et al.  PeX: A Permission Check Analysis Framework for Linux Kernel , 2019, USENIX Security Symposium.

[64]  Michael K. Reiter,et al.  Statistical Privacy for Streaming Traffic , 2019, NDSS.

[65]  Srinivas Devadas,et al.  Sanctum: Minimal Hardware Extensions for Strong Software Isolation , 2016, USENIX Security Symposium.

[66]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[67]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[68]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[69]  Michael H. Smith,et al.  Denial of Service Attacks , 2001 .

[70]  Surajit Chaudhuri,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications. , 1995 .