For storage and analysis of online or streaming data which is too big in size most organization are moving toward using Apaches Hadoop- HDFS. Applications like log processors, search engines etc. using Hadoop Map Reduce for computing and HDFS for storage. Hadoop is most popular for analysis, storage and processing very large data but there need to be lots of changes in hadoop system. Here problem of data storage and data processing try to solve which helps hadoop system to improve processing speed and reduce time to execute the task. Hadoop application requires streaming access to data files. During placement of data files default placement of Hadoop does not consider any data characteristics. If the related set of files is stored in the same set of nodes, the efficiency and access latency can be increased. Hadoop uses Map Reduce framework for implementing large-scale distributed computing on unpredicted data sets. There are potential duplicate computations being performed in this process. No mechanism is to identify such duplicate computations which increase processing time. Solution for above problem is to co-locate related files by considering content and using locality sensitive hashing algorithm which is a clustering based algorithm will try to co - locate related file streams to the same set of nodes without affecting the default scalability and fault tolerance properties of Hadoop and for avoiding duplicate computation processing mechanism is developed which store executed task with result and before execution of any task stored executed tasks are compared if task find then direct result will be provided . By storing related files in same cluster which improve data locality mechanism and avoiding repeated execution of task improves processing time, both helps to speed up execution of Hadoop. Key term — Hadoop, Hdfs, MapReduce, Hashing Algorithm. I. INTRODUCTION Apaches Hadoop is open source implementation of Google Map/Reduce framework, it enables data intensive, distributed and parallel applications by diving massive job into smaller tasks and massive data sets into smaller partition such that each task processes a different partition in parallel. Map tasks that process the partitioned data set using key/value pairs and generate some intermediate result. Reduce tasks merged all intermediate values associated with keys. Hadoop uses Hadoop Distributed File System (HDFS) which is distributed file system, used for storing large data files. Each file is divided into numbers of blocks and replicated for fault tolerance. HDFS cluster is based on master/slave architecture. Name Node work as master which manages and store the file system namespace and provide access to the client. The slaves are number of Data Nodes. HDFS provides a file system name space and allows user data to be stored in files. File is divided into number of block; size of block is normally 64MB which is too large. The default placement of Hadoop does not consider any data characteristics during placement. If related files are kept in same set of data nodes, the access latency and efficiency will be increased. The file similarity will be calculated by comparing content of it and to reduce comparison, a Locality Sensitive Hashing will be used. Hash function hash the points using different hash function in such way that probability of collision will be higher for similar points. Client is controlling overall process and providing sub-clusterid where file will be placed otherwise default placement strategy is used. Data aware cache is introduced for avoiding execution of repeated task, which requires each data object indexed by its content and also implement cache request and reply protocol. The rest of this paper is organized as follows: Section II gives overview of related work done before; Section III describes programmer's design which include mathematical model; Section IV discuses result and discussion and Section V conclude with conclusion.
[1]
K. Chitharanjan,et al.
Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop — HDFS - An infrastructure extension
,
2013,
2013 International Conference on Circuits, Power and Computing Technologies (ICCPCT).
[2]
Xian-He Sun,et al.
ADAPT: Availability-Aware MapReduce Data Placement for Non-dedicated Distributed Computing
,
2012,
2012 IEEE 32nd International Conference on Distributed Computing Systems.
[3]
Juan Enrique Ramos,et al.
Using TF-IDF to Determine Word Relevance in Document Queries
,
2003
.
[4]
Hairong Kuang,et al.
The Hadoop Distributed File System
,
2010,
2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).
[5]
Yuanyuan Tian,et al.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
,
2011,
Proc. VLDB Endow..
[6]
Zhiwei Xu,et al.
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems
,
2011,
2011 IEEE 27th International Conference on Data Engineering.
[7]
Qinghua Zheng,et al.
A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files
,
2010,
2010 IEEE International Conference on Services Computing.
[8]
Jia Li,et al.
Design of the mass multimedia files storage architecture based on Hadoop
,
2013,
2013 8th International Conference on Computer Science & Education.
[9]
Ning Zhang,et al.
ERMS: An Elastic Replication Management System for HDFS
,
2012,
2012 IEEE International Conference on Cluster Computing Workshops.
[10]
Cristina L. Abad,et al.
DARE: Adaptive Data Replication for Efficient Cluster Scheduling
,
2011,
2011 IEEE International Conference on Cluster Computing.
[11]
A. Kala Karun,et al.
A review on hadoop — HDFS infrastructure extensions
,
2013,
2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES.
[12]
Jun Wang,et al.
DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications With Interest Locality
,
2012,
IEEE Transactions on Magnetics.
[13]
Jie Wu,et al.
Dache: A data aware caching for big-data applications using the MapReduce framework
,
2013,
2013 Proceedings IEEE INFOCOM.