Big data are always processed repeatedly with small changes, which is a major form of big data processing. The feature of incremental change of big data shows that incremental computing mode can improve the performance greatly. HDFS is a distributed file system on Hadoop which is the most popular platform for big data analytics. And HDFS adopts fixed-size chunking policy, which is inefficient facing incremental computing. Therefore, in this paper, we proposed iHDFS (incremental HDFS), a distributed file system, which can provide basic guarantee for big data parallel processing. The iHDFS is implemented as an extension to HDFS. In iHDFS, Rabin fingerprint algorithm is applied to achieve content defined chunking. This policy make data chunking has much higher stability, and the intermediate processing results can be reused efficiently, so the performance of incremental data processing can be improved significantly. The effectiveness and efficiency of iHDFS have been demonstrated by the experimental results.
[1]
Chao Tian,et al.
Nova: continuous Pig/Hadoop workflows
,
2011,
SIGMOD '11.
[2]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[3]
A. Broder.
Some applications of Rabin’s fingerprinting method
,
1993
.
[4]
Frank Dabek,et al.
Large-scale Incremental Processing Using Distributed Transactions and Notifications
,
2010,
OSDI.
[5]
Lenin Ravindranath,et al.
Nectar: Automatic Management of Data and Computation in Datacenters
,
2010,
OSDI.
[6]
Michael Isard,et al.
Differential Dataflow
,
2013,
CIDR.
[7]
Howard Gobioff,et al.
The Google file system
,
2003,
SOSP '03.
[8]
Pramod Bhatotia,et al.
Incoop: MapReduce for incremental computations
,
2011,
SoCC.