Non-Authentication Based Checkpoint Fault-tolerant Vulnerability in Spark Streaming

Apache Spark uses Resilient Distributed Datasets (RDDs) as primitives for data sharing. The in-memory feature of RDD makes Spark faster but it also brings a volatile problem where a failure or a missing RDD causes Spark to recompute all the missing RDD in the lineage. A checkpoint cuts off the lineage by saving the data which is required in the coming computing, thus becoming an essential fault-tolerance mechanism. In this paper, we find that as for Spark Streaming jobs with checkpoint, user authentication is not performed while doing checkpoint during job execution. We present two typical attack scenarios where attackers exploit this vulnerability to interfere with normal users job, causing data loss or even incorrect results. And we put forward a solution which focuses on the administration of checkpoint directory permissions. The experimental results show that our scheme can effectively monitor and resist this attack.