Handling Small Size Files in Hadoop: Challenges, Opportunities, and Review

Recent technological advancements in the field of computing have been the cause of voluminous generation of data which cannot be handled effectively by traditionally available tools, processes, and systems. To effectively handle this big data, new techniques and frameworks have emerged in recent times. Hadoop is a prominent framework for managing huge amount of data. It provides efficient means for the storage, retrieval, processing, and analytics of big data. Although Hadoop works very well with large files, its performance tends to degrade when it is required to process hundreds or thousands of small size files. This paper puts forward the challenges and opportunities that may arise while handling large number of small size files. It also presents a comprehensive review of the various techniques available for efficiently handling small size files in Hadoop on the basis of certain performance parameters like access time, read/write complexity, scalability, and processing speed.

[1]  Jianli Liu,et al.  A Strategy for Small Files Processing in HDFS , 2016, ICYCSEE.

[2]  Qinghua Zheng,et al.  An optimized approach for storing and accessing small files on cloud storage , 2012, J. Netw. Comput. Appl..

[3]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Liu Changtong An improved HDFS for small file , 2016, 2016 18th International Conference on Advanced Communication Technology (ICACT).

[5]  Rajashree Shedge,et al.  Dealing with Small Files Problem in Hadoop Distributed File System , 2016 .

[6]  Mohd Abdul Ahad,et al.  Comparing and Analyzing the Characteristics of Hadoop, Cassandra and Quantcast File Systems for Handling Big Data , 2017 .

[7]  Natawut Nupairoj,et al.  Improving performance of small-file accessing in Hadoop , 2014, 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE).

[8]  Hongzhi Wang,et al.  Efficient File Accessing Techniques on Hadoop Distributed File Systems , 2016, ICYCSEE.

[9]  Marimuthu Palaniswami,et al.  Internet of Things (IoT): A vision, architectural elements, and future directions , 2012, Future Gener. Comput. Syst..

[10]  Tai-Hoon Kim,et al.  Smart City and IoT , 2017, Future Gener. Comput. Syst..

[11]  Yongfeng Huang,et al.  Hmfs: Efficient Support of Small Files Processing over HDFS , 2014, ICA3PP.