Towards an Ontology-Based Semantic Approach to Tuning Parameters to Improve Hadoop Application Performance

Hadoop MapReduce assists companies andresearchers to deal with processing large volumes of data.Hadoop has a lot of configuration parameters that must betuned in order to obtain a better application performance.However, the best tuning of the parameters is not easilyobtained by inexperienced users. Therefore, it is necessary tocreate environments that promote and motivate informationsharing and knowledge dissemination. In addition, it isimportant that all acquired knowledge be organized to bereused faster, easily and efficiently whenever necessary. Thispaper proposes an ontology-based semantic approach totuning parameters to improve Hadoop applicationperformance. The approach integrates techniques frommachine learning, semantic search and ontologies.

[1]  Herodotos Herodotou,et al.  MapReduce programming and cost-based optimization? , 2011, Proc. VLDB Endow..

[2]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[3]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[4]  Siegfried Benkner,et al.  An Adaptive Framework for the Execution of Data-Intensive MapReduce Applications in the Cloud , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[5]  Ian Horrocks,et al.  Description Logics , 2008, Handbook of Knowledge Representation.

[6]  Guanying Wang,et al.  Towards Synthesizing Realistic Workload Traces for Studying the Hadoop Ecosystem , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[7]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[8]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[9]  Ramana Rao Kompella,et al.  On the performance projectability of MapReduce , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[10]  Marvin V. Zelkowitz,et al.  Experimental Models for Validating Technology , 1998, Computer.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Palden Lama,et al.  AROMA: automated resource allocation and configuration of mapreduce environment in the cloud , 2012, ICAC '12.

[13]  Siegfried Benkner,et al.  Design of an Adaptive Framework for Utility-Based Optimization of Scientific Applications in the Cloud , 2012, 2012 IEEE Fifth International Conference on Utility and Cloud Computing.

[14]  A. Raj,et al.  Enhancement of Hadoop Clusters with Virtualization Using the Capacity Scheduler , 2012, 2012 Third International Conference on Services in Emerging Markets.

[15]  José A. B. Fortes,et al.  Towards self-caring mapreduce: Proactively reducing fault-induced execution-time penalties , 2011, 2011 International Conference on High Performance Computing & Simulation.

[16]  Maozhen Li,et al.  HSim: A MapReduce simulator in enabling Cloud Computing , 2013, Future Gener. Comput. Syst..

[17]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[18]  Mercedes G. Merayo,et al.  A formal framework to analyze cost and performance in Map-Reduce based applications , 2014, J. Comput. Sci..

[19]  Guanying Wang,et al.  Using realistic simulation for performance analysis of mapreduce setups , 2009, LSAP '09.

[20]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[21]  Boon Thau Loo,et al.  Benchmarking approach for designing a mapreduce performance model , 2013, ICPE '13.

[22]  Wu-chun Feng,et al.  Enhancing MapReduce via Asynchronous Data Processing , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[23]  Wichian Premchaiswadi,et al.  Optimizing and Tuning MapReduce Jobs to Improve the Large‐Scale Data Analysis Process , 2013, Int. J. Intell. Syst..

[24]  Mingyuan An,et al.  Using Index in the MapReduce Framework , 2010, 2010 12th International Asia-Pacific Web Conference.

[25]  Himabindu Pucha,et al.  Towards Optimizing Hadoop Provisioning in the Cloud , 2009, HotCloud.

[26]  Jianling Sun,et al.  An analytical performance model of MapReduce , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[27]  Jungkyu Han,et al.  A Hadoop performance model for multi-rack clusters , 2013, 2013 5th International Conference on Computer Science and Information Technology.

[28]  José A. B. Fortes,et al.  Grey-Box Approach for Performance Prediction in Map-Reduce Based Platforms , 2012, 2012 21st International Conference on Computer Communications and Networks (ICCCN).

[29]  Kewen Wang,et al.  Predator — An experience guided configuration optimizer for Hadoop MapReduce , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[30]  Bo Yang,et al.  Automatic task slots assignment in Hadoop MapReduce , 2011, ASBD '11.

[31]  Dan Suciu,et al.  PerfXplain: Debugging MapReduce Job Performance , 2012, Proc. VLDB Endow..

[32]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[33]  Dominique Heger Hadoop Performance Tuning - A Pragmatic & Iterative Approach , 2013 .

[34]  T. N. Vijaykumar,et al.  Tarazu: optimizing MapReduce on heterogeneous clusters , 2012, ASPLOS XVII.

[35]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[36]  Meng Wang,et al.  A Practical Performance Model for Hadoop MapReduce , 2012, 2012 IEEE International Conference on Cluster Computing Workshops.

[37]  Yan Solihin,et al.  Modeling and Analyzing Key Performance Factors of Shared Memory MapReduce , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[38]  Todd Plantenga,et al.  Using Performance Measurements to Improve MapReduce Algorithms , 2012, ICCS.

[39]  Jason Venner Tuning Your MapReduce Jobs , 2009 .

[40]  Depei Qian,et al.  Statistics-based Workload Modeling for MapReduce , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[41]  Depei Qian,et al.  Energy Prediction for MapReduce Workloads , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[42]  Geoffrey C. Fox,et al.  Improving Resource Utilization in MapReduce , 2012, 2012 IEEE International Conference on Cluster Computing.

[43]  Depei Qian,et al.  MapReduce Workload Modeling with Statistical Approach , 2011, Journal of Grid Computing.

[44]  R. Katz,et al.  A Methodology for Understanding MapReduce Performance Under Diverse Workloads , 2010 .

[45]  Albert Y. Zomaya,et al.  Network Load Analysis and Provisioning of MapReduce Applications , 2012, 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[46]  Albert Y. Zomaya,et al.  On Modelling and Prediction of Total CPU Usage for Applications in MapReduce Environments , 2012, ICA3PP.