Characterization of Hadoop Jobs Using Unsupervised Learning

MapReduce programming paradigm and its open source implementation, Apache Hadoop, is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the cloud service providers and their users. This work characterizes Hadoop jobs running on production clusters at Yahoo! using unsupervised learning. Unsupervised clustering techniques have been applied to many important problems - ranging from Social Network Analysis to Biomedical Research. We use these techniques to cluster Hadoop MapReduce jobs that are similar in characteristics., Hadoop framework generates metrics for every MapReduce job, such as number of map and reduce tasks, number of bytes read/written to local file system and HDFS etc. We use these metrics and job configuration features such as format of the input/output files, type of compression used etc to find similarity among Hadoop jobs. We study the centroids and densities of these job clusters. We also perform comparative analysis of real production workload and workload emulated by our benchmark tool, Grid Mix, by comparing job clusters of both workloads.