Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering Technique

The increased use of cyber-enabled systems and Internet-of-Things (IoT) led to a massive amount of data with different structures. Most big data solutions are built on top of the Hadoop eco-system or use its distributed file system (HDFS). However, studies have shown inefficiency in such systems when dealing with today's data. Some research overcame these problems for specific types of graph data, but today's data are more than one type of data. Such efficiency issues may lead to large-scale problems, including larger space requirements in data centers, and waste in resources (like power consumption), that in turn lead to environmental problems (such as more carbon emission) [1] , as per scholars. We propose a data-aware module for the Hadoop eco-system. We also propose a distributed encoding technique for genetic algorithms efficient data processing. Our framework allows Hadoop to manage the distribution of data and its placement based on cluster analysis of the data itself. We are able to handle a broad range of data types as well as optimize query time and resource usage. We performed experiments on multiple datasets generated via LUBM (Lehigh University Benchmark) and reported results along with performance analysis.

[1]  Chase Qishi Wu,et al.  On a multi-objective evolutionary algorithm for optimizing end-to-end performance of scientific workflows in distributed environments , 2012, SpringSim.

[2]  Clara Pizzuti,et al.  GA-Net: A Genetic Algorithm for Community Detection in Social Networks , 2008, PPSN.

[3]  Fiona Skerman,et al.  Modularity of networks , 2015 .

[4]  Abraham Silberschatz,et al.  Efficient processing of data warehousing queries in a split execution environment , 2011, SIGMOD '11.

[5]  Pablo Basanta-Val,et al.  T-Hoarder: A framework to process Twitter data streams , 2017, J. Netw. Comput. Appl..

[6]  Antonio J. Nebro,et al.  jMetal: A Java framework for multi-objective optimization , 2011, Adv. Eng. Softw..

[7]  Georg Lausen,et al.  Cascading Map-Side Joins over HBase for Scalable Join Processing , 2012, SSWS+HPCSW@ISWC.

[8]  V. K. Jayaraman,et al.  Clustering of Complex Networks and Community Detection Using Group Search Optimization , 2013, ArXiv.

[9]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[10]  Dipankar Dasgupta,et al.  Distributed evolutionary approach to data clustering and modeling , 2014, 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[11]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[12]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Clara Pizzuti,et al.  A Multi-objective Genetic Algorithm for Community Detection in Networks , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[14]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[15]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[16]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[17]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Michalis Vazirgiannis,et al.  Clustering and Community Detection in Directed Networks: A Survey , 2013, ArXiv.

[19]  Marisol García-Valls,et al.  A Distributed Real-Time Java-Centric Architecture for Industrial Systems , 2014, IEEE Transactions on Industrial Informatics.

[20]  Jürgen Umbrich,et al.  YARS2: A Federated Repository for Querying Graph Structured Data from the Web , 2007, ISWC/ASWC.

[21]  Renzo Angles,et al.  A Comparison of Current Graph Database Models , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[22]  P. Hansen,et al.  Column generation algorithms for exact modularity maximization in networks. , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Georg Lausen,et al.  PigSPARQL: A SPARQL Query Processing Baseline for Big Data , 2013, International Semantic Web Conference.

[24]  Georg Lausen,et al.  Sempala: Interactive SPARQL Query Processing on Hadoop , 2014, SEMWEB.

[25]  Mustafa H. Hajeer Distributed Evolutionary Algorithm for Clustering Multi-Characteristic Social Networks , 2014 .

[26]  Dipankar Dasgupta,et al.  Political Communities in Russian Portion of Liveournal , 2014, 2014 International Conference on Computational Science and Computational Intelligence.

[27]  Qingfu Zhang,et al.  Community detection in networks by using multiobjective evolutionary algorithm with decomposition , 2012 .

[28]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[30]  Ronghua Shang,et al.  Community detection based on modularity and an improved genetic algorithm , 2013 .

[31]  Aman Kumar Sharma,et al.  Proposed algorithms for effective real time stream analysis in big data , 2015, 2015 Third International Conference on Image Information Processing (ICIIP).

[32]  Andy J. Wellings,et al.  Architecting Time-Critical Big-Data Systems , 2016, IEEE Transactions on Big Data.

[33]  Georg Lausen,et al.  Map-Side Merge Joins for Scalable SPARQL BGP Processing , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[34]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[35]  Zhihan Lv,et al.  Next-Generation Big Data Analytics: State of the Art, Challenges, and Future Research Topics , 2017, IEEE Transactions on Industrial Informatics.

[36]  Zuren Feng,et al.  Community detection using Ant Colony Optimization , 2013, IEEE Congress on Evolutionary Computation.

[37]  Amiya Nayak,et al.  Handbook of Applied Algorithms: Solving Scientific, Engineering, and Practical Problems , 2008 .

[38]  Qingfu Zhang,et al.  Identification of multi-resolution network structures with multi-objective immune algorithm , 2013, Appl. Soft Comput..

[39]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[40]  Richard E. Schantz,et al.  Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store , 2011, DIDC '11.

[41]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[42]  Georg Lausen,et al.  PigSPARQL: mapping SPARQL to Pig Latin , 2011, SWIM '11.

[43]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[44]  Andy J. Wellings,et al.  Improving the predictability of distributed stream processors , 2015, Future Gener. Comput. Syst..

[45]  Yvonne Freeh Graph Theoretic Techniques For Web Content Mining , 2016 .

[46]  Roberto Baldoni,et al.  Adaptive online scheduling in storm , 2013, DEBS.

[47]  Orri Erling Towards Web Scale RDF , 2008 .

[48]  Claudio Gutiérrez,et al.  Bipartite Graphs as Intermediate Model for RDF , 2004, SEMWEB.

[49]  Haluk Bingol,et al.  Community Detection in Complex Networks Using Genetic Algorithms , 2006, 0711.0491.

[50]  Barbara A. Eckman,et al.  Graph data management for molecular and cell biology , 2006, IBM J. Res. Dev..

[51]  S. Sudarshan,et al.  Data models , 1996, CSUR.

[52]  Jianwu Li,et al.  Community detection in complex networks using extended compact genetic algorithm , 2012, Soft Computing.

[53]  Jianyong Wang,et al.  Parallel community detection on large networks with propinquity dynamics , 2009, KDD.

[54]  Ulrik Brandes,et al.  On Modularity Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.