Research of Hadoop-based data flow management system

Abstract The data flow is very important at the field of network management, network security and network experiments. But with the rapidly growing of computer network data, traditional data flow management system has been unable to meet the current needs. Combined the features of data flow management systems and the advantages of Hadoop cloud computing platforms, this paper design and implement a Hadoop-based platform for distributed data flow management systems, using MapReduce to process the user request, using Hadoop distributed file system (HDFS) to manage the data flow files, and using Hadoop database (HBase) to manage the data flow information. Test shows that this system is better than the traditional data flow management system in aspects of efficiency, scalability and reliability. And it can meet the large demands for data flow management better.

[1]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Calvin Lin,et al.  Midas for government: Integration of government spending data on Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[6]  Joseph M. Hellerstein,et al.  Flux: an adaptive partitioning operator for continuous query systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[7]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[8]  Michael J. Franklin,et al.  PSoup: a system for streaming queries over streaming data , 2003, The VLDB Journal.