With the rapid and continuous growth of annual network traffic comes the need to develop systems that can efficiently scale to meet the demands of analyzing all this traffic data. The Hadoop ecosystem provides an environment that is capable of addressing this need, because of its horizontal scalability and its data locality optimization feature. The latter feature improves parallel analysis of data by placing computing tasks within the same node that contains the block of data to be analyzed. However, this feature cannot be taken advantage of by those input formats that are not splittable within the Hadoop Distributed File System. The PCAP format used for capturing network data is one such file format. To address this issue, this paper proposes the inclusion of a minimal preprocessing step before PCAP files are fed into Hadoop and analyzed using the hcap framework, which is currently the fastest framework for analyzing PCAP data in Hadoop. This preprocessing step is designed to adapt the PCAP files into properly split blocks in order to take advantage of Hadoop's data locality optimization feature. Results show a significant improvement in query response time with a performance gain of 92%, 89%, 91%, and, 87% for scan, aggregate, join, and aggregate-join queries respectively when compared to the original hcap framework.
[1]
X. Zhou,et al.
Exploring Netflow Data using Hadoop
,
2014
.
[2]
Xiao Liu,et al.
Hobbits: Hadoop and Hive based Internet traffic analysis
,
2016,
2016 IEEE International Conference on Big Data (Big Data).
[3]
Stuart Cheshire,et al.
Internet Assigned Numbers Authority (IANA) Procedures for the Management of the Service Name and Transport Protocol Port Number Registry
,
2011,
RFC.
[4]
Jon Postel,et al.
Internet Protocol
,
1981,
RFC.
[5]
Jim Gray,et al.
Distributed Computing Economics
,
2004,
ACM Queue.
[6]
William Emmanuel Yu,et al.
Towards Large Scale Packet Capture and Network Flow Analysis on Hadoop
,
2018,
2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW).
[7]
Miguel Zenon Nicanor.
A Comparison between Text, Parquet, and PCAP Formats for Use in Distributed Network Flow Analysis on Hadoop
,
2017
.
[8]
Giovane C. M. Moura,et al.
ENTRADA: enabling DNS big data applications
,
2016,
2016 APWG Symposium on Electronic Crime Research (eCrime).
[9]
Youngseok Lee,et al.
Toward scalable internet traffic measurement and analysis with Hadoop
,
2013,
CCRV.
[10]
Cisco Visual Networking Index: Forecast and Methodology 2016-2021.(2017) http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual- networking-index-vni/complete-white-paper-c11-481360.html. High Efficiency Video Coding (HEVC) Algorithms and Architectures https://jvet.hhi.fraunhofer.
,
2017
.