MapReduce Analysis for Cloud-Archived Data

Public storage clouds have become a popular choice for archiving certain classes of enterprise data - for example, application and infrastructure logs. These logs contain sensitive information like IP addresses or user logins due to which regulatory and security requirements often require data to be encrypted before moved to the cloud. In order to leverage such data for any business value, analytics systems (e.g. Hadoop/MapReduce) first download data from these public clouds, decrypt it and then process it at the secure enterprise site. We propose VNCache: an efficient solution for MapReduceanalysis of such cloud-archived log data without requiring an apriori data transfer and loading into the local Hadoop cluster. VNcache dynamically integrates cloud-archived data into a virtual namespace at the enterprise Hadoop cluster. Through a seamless data streaming and prefetching model, Hadoop jobs can begin execution as soon as they are launched without requiring any apriori downloading. With VNcache's accurate pre-fetching and caching, jobs often run on a local cached copy of the data block significantly improving performance. When no longer needed, data is safely evicted from the enterprise cluster reducing the total storage footprint. Uniquely, VNcache is implemented with NO changes to the Hadoop application stack.

[1]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[2]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[3]  Aameek Singh,et al.  Server-storage virtualization: Integration and load balancing in data centers , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[5]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[6]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[9]  David R. O'Hallaron,et al.  Tashi: location-aware cluster management , 2009, ACDC '09.

[10]  Thomas Sandholm,et al.  MapReduce optimization using regulated dynamic prioritization , 2009, SIGMETRICS '09.

[11]  Madalin Mihailescu,et al.  MixApart: decoupled analytics for shared storage systems , 2012, FAST.

[12]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[13]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[14]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[15]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[16]  Kyungho Jeon,et al.  The HybrEx Model for Confidentiality and Privacy in Cloud Computing , 2011, HotCloud.

[17]  XiaoFeng Wang,et al.  Sedic: privacy-aware data intensive computing on hybrid clouds , 2011, CCS '11.

[18]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[19]  Junda Liu,et al.  Multi-enterprise networking , 2000 .

[20]  Ramesh K. Sitaraman,et al.  Optimizing MapReduce for Highly Distributed Environments , 2012, ArXiv.

[21]  Ling Liu,et al.  Purlieus: Locality-aware resource allocation for MapReduce in a cloud , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.