A technique for parallel query optimization using MapReduce framework and a semantic-based clustering method

Query optimization is the process of identifying the best Query Execution Plan (QEP). The query optimizer produces a close to optimal QEP for the given queries based on the minimum resource usage. The problem is that for a given query, there are plenty of different equivalent execution plans, each with a corresponding execution cost. To produce an effective query plan thus requires examining a large number of alternative plans. Access plan recommendation is an alternative technique to database query optimization, which reuses the previously-generated QEPs to execute new queries. In this technique, the query optimizer uses clustering methods to identify groups of similar queries. However, clustering such large datasets is challenging for traditional clustering algorithms due to huge processing time. Numerous cloud-based platforms have been introduced that offer low-cost solutions for the processing of distributed queries such as Hadoop, Hive, Pig, etc. This paper has applied and tested a model for clustering variant sizes of large query datasets parallelly using MapReduce. The results demonstrate the effectiveness of the parallel implementation of query workloads clustering to achieve good scalability.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Jérôme Darmont,et al.  Clustering-Based Materialized View Selection in Data Warehouses , 2006, ADBIS.

[3]  Nima Jafari Navimipour,et al.  Query optimization mechanisms in the cloud environments: A systematic study , 2019, Int. J. Commun. Syst..

[4]  Abderrahim El Qadi,et al.  A Recommendation System for Execution Plans Using Machine Learning , 2016 .

[5]  Nima Jafari Navimipour,et al.  An automatic clustering technique for query plan recommendation , 2021, Inf. Sci..

[6]  T. Revathi,et al.  Fuzzy-based Security-Driven Optimistic Scheduling of Scientific Workflows in Cloud Computing , 2018, IETE Journal of Research.

[7]  Salma Mouline,et al.  Access plan recommendation: A clustering based approach using queries similarity , 2014, 2014 Second World Conference on Complex Systems (WCCS).

[8]  Amir Masoud Rahmani,et al.  Artificial intelligence approaches and mechanisms for big data analytics: a systematic study , 2021, PeerJ Comput. Sci..

[9]  Sheetal Kalra,et al.  A Lightweight User Authentication Scheme for Cloud-IoT Based Healthcare Services , 2018, Iranian Journal of Science and Technology, Transactions of Electrical Engineering.

[10]  Michael Hahsler,et al.  dbscan: Fast Density-Based Clustering with R , 2019, Journal of Statistical Software.

[11]  Nima Jafari Navimipour,et al.  An energy‐aware method for data replication in the cloud environments using a Tabu search and particle swarm optimization algorithm , 2018, Concurr. Comput. Pract. Exp..

[12]  Hassane Bouzahir,et al.  Convolutional neural networks approach for multimodal biometric identification system using the fusion of fingerprint, finger-vein and face images , 2020, PeerJ Comput. Sci..

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[15]  Azhari,et al.  Enhancement of conformational B-cell epitope prediction using CluSMOTE , 2020, PeerJ Comput. Sci..

[16]  Bikash Chandra,et al.  Data generation for testing and grading SQL queries , 2015, The VLDB Journal.

[17]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[18]  Jayant R. Haritsa,et al.  Green Query Optimization: Taming Query Optimization Overheads through Plan Recycling , 2004, VLDB.

[19]  Sang-goo Lee,et al.  Efficient query processing on distributed stream processing engine , 2017, IMCOM.

[20]  Jayant R. Haritsa,et al.  Plan Selection Based on Query Clustering , 2002, VLDB.

[21]  Nima Jafari Navimipour,et al.  Deterministic and non‐deterministic query optimization techniques in the cloud computing , 2019, Concurr. Comput. Pract. Exp..

[22]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[23]  Nima Jafari Navimipour,et al.  Join query optimization in the distributed database system using an artificial bee colony algorithm and genetic operators , 2019, Concurr. Comput. Pract. Exp..

[24]  Rafael D. C. Santos,et al.  Text Mining Applied to SQL Queries: A Case Study for the SDSS SkyServer , 2015, SIMBig.

[25]  Spyros Sioutas,et al.  CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce , 2014, AIAI Workshops.

[26]  Matteo Golfarelli,et al.  Similarity measures for OLAP sessions , 2013, Knowledge and Information Systems.

[27]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[28]  Nima Jafari Navimipour,et al.  A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop , 2019, J. Netw. Comput. Appl..

[29]  Rajkumar Buyya,et al.  Cloud Computing Principles and Paradigms , 2011 .

[30]  Nima Jafari Navimipour,et al.  A New Preventive Routing Method Based on Clustering and Location Prediction in the Mobile Internet of Things , 2021, IEEE Internet of Things Journal.

[31]  Yung-Ming Cheng,et al.  Can tasks and learning be balanced? A dual-pathway model of cloud-based e-learning continuance intention and performance outcomes , 2021, Kybernetes.

[32]  Duc Thanh Anh Luong,et al.  Similarity Metrics for SQL Query Clustering , 2018, IEEE Transactions on Knowledge and Data Engineering.

[33]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[34]  K. Muneeswaran,et al.  Fault-Tolerant Based Group Key Servers with Enhancement of Utilizing the Contributory Server for Cloud Storage Applications , 2021 .

[35]  Vikram Singh Multi-objective Parametric Query Optimization for Distributed Database Systems , 2015, SocProS.