论文信息 - Using Performance Measurements to Improve MapReduce Algorithms

Using Performance Measurements to Improve MapReduce Algorithms

Abstract The Hadoop MapReduce software environment is used for parallel processing of distributively stored data. Data mining algorithms of increasing sophistication are being implemented in MapReduce, bringing new challenges for performance measurement and tuning. We focus on analyzing a job after completion, utilizing information collected from Hadoop logs and machine metrics. Our analysis, inspired by [1] [2] , goes beyond conventional Hadoop Job-Tracker analysis by integrating more data and providing web browser visualization tools. This paper describes examples where measurements helped diagnose subtle issues and improve algorithm performance. Examples demonstrate the value of correlating detailed information that is not usually examined in standard Hadoop performance displays.

Todd Plantenga | Yung Ryn Choe | Ann Yoshimura

[1] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[2] Ana Paula Appel,et al. Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations , 2010, SDM.

[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4] Andy Konwinski,et al. Chukwa: A large-scale monitoring system , 2008 .

[5] Rajeev Gandhi,et al. An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[6] David E. Culler,et al. The ganglia distributed monitoring system: design, implementation, and experience , 2004, Parallel Comput..

[7] Christos Faloutsos,et al. PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.