Mineração em Grandes Massas de Dados Utilizando Hadoop MapReduce e Algoritmos Bio-inspirados: Uma Revisão Sistemática

A Area de Mineracao de Dados tem sido utilizada em diversas areas de aplicacao e visa extrair conhecimento atraves da analise de dados. Nas ultimas decadas, inumeras bases de dados estao tendenciando a possuir grande volume, alta velocidade de crescimento e grande variedade. Esse fenomeno e conhecido como Big Data e corresponde a novos desafios para tecnologias classicas como Sistema de Gestao de Banco de Dados Relacional pois nao tem oferecido desempenho satisfatorio e escalabilidade para aplicacoes do tipo Big Data. Ao contrario dessas tecnologias, Hadoop MapReduce e um framework que, alem de prover processamento paralelo, tambem fornece tolerância a falhas e facil escalabilidade sobre um sistema de armazenamento distribuido compativel com cenario Big Data. Uma das tecnicas que vem sendo utilizada no contexto Big Data sao algoritmos bio-inspirados. Esses algoritmos sao boas opcoes de solucao em problemas complexos multidimensionais, multiobjetivos e de grande escala. A combinacao de sistemas baseados em Hadoop MapReduce e algoritmos bio-inspirados tem se mostrado vantajoso em aplicacoes Big Data. Esse artigo apresenta uma revisao sistematica de trabalhos nesse contexto, visando analisar criterios como: tarefas de mineracao de dados abordadas, algoritmos bio-inspirados utilizados, disponibilidade das bases utilizadas e quais caracteristicas Big Data sao tratadas nos trabalhos. Como resultado, esse artigo discute os criterios analisados e identifica alguns modelos de paralelizacao, alem de sugerir uma direcao para trabalhos futuros.

[1]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[2]  Emad A. Mohammed,et al.  Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends , 2014, BioData Mining.

[3]  Debajyoti Mukhopadhyay,et al.  A Survey of Classification Techniques in the Area of Big Data , 2015, ArXiv.

[4]  Ping-Tsai Chung,et al.  On data integration and data mining for developing business intelligence , 2013, 2013 IEEE Long Island Systems, Applications and Technology Conference (LISAT).

[5]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[6]  A. Asbern,et al.  Performance evaluation of association mining in Hadoop single node cluster with Big Data , 2015, 2015 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2015].

[7]  M. Tahar Kechadi,et al.  A parallel genetic algorithms framework based on Hadoop MapReduce , 2015, SAC.

[8]  TallonPaul Corporate Governance of Big Data , 2013 .

[9]  Nivranshu Hans,et al.  Big Data Clustering Using Genetic Algorithm On Hadoop Mapreduce , 2015 .

[10]  H. Sarwar,et al.  An In-depth Study of Map Reduce in Cloud Environment , 2012, 2012 International Conference on Advanced Computer Science Applications and Technologies (ACSAT).

[11]  B. Bitzer,et al.  Grid Computing as an innovative solution for power system's reliability and redundancy , 2009, 2009 International Conference on Clean Electrical Power.

[12]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[13]  M. A. Maffina,et al.  An improved and efficient message passing interface for secure communication on distributed clusters , 2013, 2013 International Conference on Recent Trends in Information Technology (ICRTIT).

[14]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[15]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[16]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]  Michal Pluhacek,et al.  Evolutionary algorithms dynamics and its hidden complex network structures , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[18]  Dan Wu,et al.  Research on Database Massive Data Processing and Mining Method based on Hadoop Cloud Platform , 2014, 2014 International Conference on Identification, Information and Knowledge in the Internet of Things.

[19]  Pearl Brereton,et al.  Systematic literature reviews in software engineering - A systematic literature review , 2009, Inf. Softw. Technol..

[20]  Ibrahim Aljarah,et al.  MapReduce intrusion detection system based on a particle swarm optimization clustering algorithm , 2013, 2013 IEEE Congress on Evolutionary Computation.

[21]  Paul P. Tallon Corporate Governance of Big Data: Perspectives on Value, Risk, and Cost , 2013, Computer.

[22]  E. Sivaraman,et al.  High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop , 2014, 2014 International Conference on Intelligent Computing Applications.

[23]  Gagan Agrawal,et al.  Fault tolerant parallel data-intensive algorithms , 2012, 2012 19th International Conference on High Performance Computing.

[24]  Sergio Ramírez-Gallego,et al.  Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach , 2015 .

[25]  Rajkumar Buyya,et al.  MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms , 2008, 2008 IEEE Fourth International Conference on eScience.

[26]  G. Sudha Sadasivam,et al.  A novel parallel hybrid K-means-DE-ACO clustering approach for genomic clustering using MapReduce , 2011, 2011 World Congress on Information and Communication Technologies.

[27]  S. Siva Sathya,et al.  A Survey of Bio inspired Optimization Algorithms , 2012 .

[28]  Cees T. A. M. de Laat,et al.  Addressing big data issues in Scientific Data Infrastructure , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[29]  Ibrahim Aljarah,et al.  Parallel particle swarm optimization clustering algorithm based on MapReduce methodology , 2012, 2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC).

[30]  J. Jayakumari,et al.  An efficient hybrid distributed document clustering algorithm , 2015 .

[31]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[32]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[33]  Cees T. A. M. de Laat,et al.  Defining architecture components of the Big Data Ecosystem , 2014, 2014 International Conference on Collaboration Technologies and Systems (CTS).

[34]  Stuart Bailey,et al.  Hadoop Acceleration in an OpenFlow-Based Cluster , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[35]  Rafael S. Parpinelli,et al.  New inspirations in swarm intelligence: a survey , 2011, Int. J. Bio Inspired Comput..

[36]  Mohamed Batouche,et al.  Parallel diffrential evolution clustering algorithm based on MapReduce , 2014, 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[37]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[38]  Yan Yang,et al.  Parallel Implementation of Ant-Based Clustering Algorithm Based on Hadoop , 2012, ICSI.

[39]  Ganesh Vaidyanathan,et al.  Performance Evaluation of Bio-Inspired Optimization Algorithms in Resolving Chromosomal Occlusions , 2015 .

[40]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[41]  José E. Moreira,et al.  Performance Studies of a WebSphere Application, Trade, in Scale-out and Scale-up Environments , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[42]  R. B. Sachin,et al.  A Survey and Future Vision of Data Mining in Educational Field , 2012, 2012 Second International Conference on Advanced Computing & Communication Technologies.

[43]  Bhabesh Nath,et al.  Multi-objective rule mining using genetic algorithms , 2004, Inf. Sci..

[44]  Janez Brest,et al.  A Brief Review of Nature-Inspired Algorithms for Optimization , 2013, ArXiv.

[45]  Nostrand Reinhold,et al.  the utility of using the genetic algorithm approach on the problem of Davis, L. (1991), Handbook of Genetic Algorithms. Van Nostrand Reinhold, New York. , 1991 .

[46]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[47]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[48]  Simone A. Ludwig,et al.  Scaling Genetic Programming for data classification using MapReduce methodology , 2013, 2013 World Congress on Nature and Biologically Inspired Computing.

[49]  A Novel Ant based Clustering of Gene Expression Data using MapReduce Framework , 2014 .

[50]  Zhihua Cui,et al.  Swarm Intelligence and Bio-Inspired Computation: Theory and Applications , 2013 .

[51]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[52]  Czeslaw Smutnicki New trends in optimization , 2010, 2010 IEEE 14th International Conference on Intelligent Engineering Systems.

[53]  Rafael S. Parpinelli,et al.  Biological plausibility in optimisation: an ecosystemic view , 2012, Int. J. Bio Inspired Comput..

[54]  Ibrahim Aljarah,et al.  Parallel glowworm swarm optimization clustering algorithm based on MapReduce , 2014, 2014 IEEE Symposium on Swarm Intelligence.

[55]  Debasish Ghose,et al.  Detection of multiple source locations using a glowworm metaphor with applications to collective robotics , 2005, Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005..

[56]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[57]  Avita Katal,et al.  Big data: Issues, challenges, tools and Good practices , 2013, 2013 Sixth International Conference on Contemporary Computing (IC3).

[58]  Tao Zhong,et al.  Blending SQL and NewSQL Approaches: Reference Architectures for Enterprise Big Data Challenges , 2013, 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.