Big Data and Big Data Scrutiny with Hadoop's MapReduce
暂无分享,去创建一个
Big Data is highly related to large-volume of data, complex with evolving relationship, growing data sets with multiple, heterogeneous and self-governing sources. There is a faster development of networking along with data storage and collection capacity. The data is said as "Big Data" due to its characteristics of Volume, Variety, Velocity and Veracity. Most of this Big Data is unstructured, semi structured and heterogeneous in nature. The volume and the heterogeneity of Big Data, with the speed it is generated, make it difficult for the present computing infrastructure to manage Big Data. Because of this nature of Big Data, traditional data management, warehousing and analysis systems are not satisfactorily able to analyze this data. In order to process Big Data, HACE Theorem is considered that characterizes the features of Big Data for Big Data Processing. Hadoop and HDFS by Apache is a software framework which is widely used for storing, managing and analyzing Big Data which is a challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. Hadoop’s MapReduce is widely being used for the efficient processing of large data sets on clusters which is nothing but Big Data. In this paper, the various solutions are introduced in hand through Map Reduce framework over Hadoop Distributed File System (HDFS). Map Reduce is a Minimization technique which makes use of file indexing with mapping, sorting, shuffling and finally reducing. Map Reduce techniques have been introduced which is implemented for Big Data analysis using HDFS.
[1] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.
[2] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.