Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks

The term ‘Big Data’ has spread rapidly in the framework of Data Mining and Business Intelligence. This new scenario can be defined by means of those problems that cannot be effectively or efficiently addressed using the standard computing resources that we currently have. We must emphasize that Big Data does not just imply large volumes of data but also the necessity for scalability, i.e., to ensure a response in an acceptable elapsed time. When the scalability term is considered, usually traditional parallel‐type solutions are contemplated, such as the Message Passing Interface or high performance and distributed Database Management Systems. Nowadays there is a new paradigm that has gained popularity over the latter due to the number of benefits it offers. This model is Cloud Computing, and among its main features we has to stress its elasticity in the use of computing resources and space, less management effort, and flexible costs. In this article, we provide an overview on the topic of Big Data, and how the current problem can be addressed from the perspective of Cloud Computing and its programming frameworks. In particular, we focus on those systems for large‐scale analytics based on the MapReduce scheme and Hadoop, its open‐source implementation. We identify several libraries and software projects that have been developed for aiding practitioners to address this new programming model. We also analyze the advantages and disadvantages of MapReduce, in contrast to the classical solutions in this field. Finally, we present a number of programming frameworks that have been proposed as an alternative to MapReduce, developed under the premise of solving the shortcomings of this model in certain scenarios and platforms. WIREs Data Mining Knowl Discov 2014, 4:380–409. doi: 10.1002/widm.1134

[1]  Eero Vainikko,et al.  Adapting scientific computing problems to clouds using MapReduce , 2012, Future Gener. Comput. Syst..

[2]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[3]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[4]  James Murty,et al.  Programming amazon web services , 2008 .

[5]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[6]  Ryan Hafen,et al.  Visualization Databases for the Analysis of Large Complex Datasets , 2009, AISTATS.

[7]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[8]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[11]  Markus Grünwald,et al.  Business Intelligence , 2009, Informatik-Spektrum.

[12]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[13]  Yanpei Chen,et al.  Big data and internships at Cloudera , 2012, XRDS.

[14]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[15]  Shrideep Pallickara,et al.  On the performance of high dimensional data clustering and classification algorithms , 2013, Future Gener. Comput. Syst..

[16]  Efraim Turban,et al.  Business Intelligence: Second European Summer School, eBISS 2012, Brussels, Belgium, July 15-21, 2012, Tutorial Lectures , 2013 .

[17]  Schahram Dustdar,et al.  Elastic stream processing in the Cloud , 2013, WIREs Data Mining Knowl. Discov..

[18]  GhemawatSanjay,et al.  The Google file system , 2003 .

[19]  E. W. T. Ngai,et al.  A literature review and classification of electronic commerce research , 2002, Inf. Manag..

[20]  Lyndsay Wise Using Open Source Platforms for Business Intelligence: Avoid Pitfalls and Maximize ROI , 2012 .

[21]  Dan Frankowski,et al.  Collaborative Filtering Recommender Systems , 2007, The Adaptive Web.

[22]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[23]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.

[24]  Ian F. Akyildiz,et al.  Sensor Networks , 2002, Encyclopedia of GIS.

[25]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[26]  Alfred Kobsa,et al.  The Adaptive Web, Methods and Strategies of Web Personalization , 2007, The Adaptive Web.

[27]  Nick Dimiduk,et al.  HBase in Action , 2012 .

[28]  Charles R. Severance,et al.  Discovering JavaScript Object Notation , 2012, Computer.

[29]  Kristina Chodorow,et al.  MongoDB - The Definitive Guide: Powerful and Scalable Data Storage , 2019 .

[30]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[31]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[32]  Shichao Zhang,et al.  Association Rule Mining: Models and Algorithms , 2002 .

[33]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[34]  Tim Kraska,et al.  Finding the Needle in the Big Data Systems Haystack , 2013, IEEE Internet Computing.

[35]  Barbara Wixom,et al.  The Current State of Business Intelligence , 2007, Computer.

[36]  Steven J. Plimpton,et al.  MapReduce in MPI for Large-scale graph algorithms , 2011, Parallel Comput..

[37]  Mona Nasr,et al.  Business intelligence software as a service (SAAS) , 2011, 2011 IEEE 3rd International Conference on Communication Software and Networks.

[38]  Eugene Wong,et al.  Introduction to a system for distributed databases (SDD-1) , 1980, TODS.

[39]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[40]  Tom Fawcett,et al.  Data Science and its Relationship to Big Data and Data-Driven Decision Making , 2013, Big Data.

[41]  Divyakant Agrawal,et al.  Big data and cloud computing: current state and future opportunities , 2011, EDBT/ICDT '11.

[42]  Mukesh K. Mohania,et al.  Cloud Computing and Big Data Analytics: What Is New from Databases Perspective? , 2012, BDA.

[43]  Brian David Johnson,et al.  Entertainment in the Age of Big Data , 2012, Proceedings of the IEEE.

[44]  A. Mobasheri,et al.  Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. , 2013, Omics : a journal of integrative biology.

[45]  Xian-He Sun,et al.  Optimizing HPC Fault-Tolerant Environment: An Analytical Approach , 2010, 2010 39th International Conference on Parallel Processing.

[46]  Chengqi Zhang,et al.  Association Rule Mining , 2002, Lecture Notes in Computer Science.

[47]  Bowei Xi,et al.  Large complex data: divide and recombine (D&R) with RHIPE , 2012 .

[48]  Chris Rose,et al.  A Break in the Clouds: Towards a Cloud Definition , 2011 .

[49]  Vijay Srinivas Agneeswaran Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives , 2014 .

[50]  Younghoon Kim,et al.  Parallel Top-K Similarity Join Algorithms Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[51]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[52]  James Murty,et al.  Programming Amazon web services - S3, EC2, SQS, FPS, and SimpleDB: outsource your infrastructure , 2008 .

[53]  Yang Xiao,et al.  Achieving Accountable MapReduce in cloud computing , 2014, Future Gener. Comput. Syst..

[54]  Benno Schwikowski,et al.  Mining proteomic data for biomedical research , 2012, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[55]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[56]  Mikel Galar,et al.  Minutiae filtering to improve both efficacy and efficiency of fingerprint matching algorithms , 2014, Eng. Appl. Artif. Intell..

[57]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[58]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[59]  Daniel Peralta,et al.  Fast fingerprint identification for large databases , 2014, Pattern Recognit..

[60]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[61]  Rajkumar Buyya,et al.  Cloud Computing Principles and Paradigms , 2011 .

[62]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[63]  Shaojie Qiao,et al.  Parallel Sequential Pattern Mining of Massive Trajectory Data , 2010, Int. J. Comput. Intell. Syst..

[64]  Ioannis Koumpouros,et al.  Big Data & Cloud Computing στην Υγεία , 2015 .

[65]  Tom Fawcett,et al.  Data science for business , 2013 .

[66]  Yi Pan,et al.  International Journal of Approximate Reasoning a Comparison of Parallel Large-scale Knowledge Acquisition Using Rough Set Theory on Different Mapreduce Runtime Systems , 2022 .

[67]  Przemyslaw Kazienko,et al.  Parallel processing of large graphs , 2013, Future Gener. Comput. Syst..

[68]  Chin-Feng Lai,et al.  CPRS: A Cloud-Based Program Recommendation System for Digital TV Platforms , 2010, GPC.

[69]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[70]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[71]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[72]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[73]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[74]  Ian Witten,et al.  Data Mining , 2000 .

[75]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[76]  Stefan Wrobel,et al.  Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[77]  Andreas Reuter,et al.  Principles of transaction-oriented database recovery , 1983, CSUR.

[78]  Per Oscarson,et al.  Information Security Fundamentals , 2019, World Conference on Information Security Education.

[79]  Robert A. Lordo,et al.  Learning from Data: Concepts, Theory, and Methods , 2001, Technometrics.

[80]  Tom Fawcett,et al.  Data science for business , 2013 .

[81]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[82]  Dinesh Manocha,et al.  Query co-processing on commodity processors , 2006, VLDB.

[83]  Xavier Llorà,et al.  Large‐scale data mining using genetics‐based machine learning , 2013, GECCO.

[84]  Tianrui Li,et al.  An Improved Cop-Kmeans Clustering for Solving Constraint Violation Based on MapReduce Framework , 2013, Fundam. Informaticae.

[85]  Jonathan M. Garibaldi,et al.  Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data , 2012, PloS one.

[86]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[87]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[88]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[89]  Charles R. Severance Van Jacobson: Getting NSFNet off the Ground , 2012, Computer.

[90]  Divesh Srivastava,et al.  Data Management Challenges and Opportunities in Cloud Computing , 2012, DASFAA.

[91]  Samuel Madden,et al.  From Databases to Big Data , 2012, IEEE Internet Comput..

[92]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[93]  John L. Klepeis,et al.  A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[94]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[95]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[96]  Bracha Shapira,et al.  Recommender Systems Handbook , 2015, Springer US.

[97]  Michael Hausenblas,et al.  Apache Drill: Interactive Ad-Hoc Analysis at Scale , 2013, Big Data.

[98]  Fuzhen Zhuang,et al.  A parallel incremental extreme SVM classifier , 2011, Neurocomputing.

[99]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[100]  Jongwook Woo Market Basket Analysis algorithms with MapReduce , 2013, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[101]  Kyuseok Shim,et al.  MapReduce Algorithms for Big Data Analysis , 2012, Proc. VLDB Endow..

[102]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[103]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[104]  Limsoon Wong,et al.  Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes , 2013, BMC Bioinformatics.

[105]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[106]  Rajesh Nadipalli HDInsight Essentials , 2013 .

[107]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.

[108]  Wenji Mao,et al.  Social Computing: From Social Informatics to Social Intelligence , 2007, IEEE Intell. Syst..

[109]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[110]  J. Chris Anderson,et al.  CouchDB - The Definitive Guide: Time to Relax , 2010 .

[111]  Arshdeep Bahga,et al.  Analyzing Massive Machine Maintenance Data in a Computing Cloud , 2012, IEEE Transactions on Parallel and Distributed Systems.

[112]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[113]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[114]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[115]  Robert L. Grossman,et al.  Compute and storage clouds using wide area high performance networks , 2008, Future Gener. Comput. Syst..

[116]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[117]  Günther Specht,et al.  Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds , 2012, BMC Bioinformatics.

[118]  Vladimir Cherkassky,et al.  Learning from Data: Concepts, Theory, and Methods , 1998 .

[119]  Da Ruan,et al.  A parallel method for computing rough set approximations , 2012, Inf. Sci..

[120]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[121]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[122]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[123]  Alan R. Hevner,et al.  Integrated decision support systems: A data warehousing perspective , 2007, Decis. Support Syst..

[124]  Kyoung-Don Kang,et al.  Grex: An efficient MapReduce framework for graphics processing units , 2013, J. Parallel Distributed Comput..

[125]  Sahil R. Kalra,et al.  Big Challenges? Big Data … , 2015 .

[126]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[127]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[128]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[129]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[130]  Ibrahim Aljarah,et al.  Parallel particle swarm optimization clustering algorithm based on MapReduce methodology , 2012, 2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC).

[131]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[132]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[133]  Ying Chen,et al.  Rapid processing of remote sensing images based on cloud computing , 2013, Future Gener. Comput. Syst..

[134]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[135]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[136]  S. Fawcett,et al.  Data Science, Predictive Analytics, and Big Data: A Revolution that Will Transform Supply Chain Design and Management , 2013 .

[137]  Ashwin Srinivasan,et al.  Data and task parallelism in ILP using MapReduce , 2011, Machine Learning.

[138]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[139]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[140]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[141]  Fuzhen Zhuang,et al.  Parallel sampling from big data with uncertainty distribution , 2015, Fuzzy Sets Syst..

[142]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[143]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[144]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[145]  Xian-He Sun,et al.  Performance comparison under failures of MPI and MapReduce: An analytical approach , 2013, Future Gener. Comput. Syst..

[146]  David G. Stork,et al.  Pattern Classification , 1973 .

[147]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[148]  Sean Owen,et al.  Mahout in Action , 2011 .

[149]  Ira Assent,et al.  Clustering high dimensional data , 2012 .

[150]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[151]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[152]  Younghoon Kim,et al.  TWILITE: A recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation , 2014, Inf. Syst..

[153]  Quinton Anderson Storm Real-Time Processing Cookbook , 2013 .

[154]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[155]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[156]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[157]  Mike P. Papazoglou,et al.  Service oriented architectures: approaches, technologies and research issues , 2007, The VLDB Journal.

[158]  Rachel Schutt,et al.  Doing Data Science , 2013 .

[159]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[160]  Chris Mattmann,et al.  Computing: A vision for data science , 2013, Nature.

[161]  Jimmy J. Lin MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail! , 2012, Big Data.

[162]  Athena Vakali,et al.  Integrating similarity and dissimilarity notions in recommenders , 2013, Expert Syst. Appl..

[163]  Tim Kraska,et al.  MLbase: A Distributed Machine-learning System , 2013, CIDR.

[164]  Anne E. Trefethen,et al.  The UK e-Science Core Programme and the Grid , 2002, Future Gener. Comput. Syst..

[165]  Toby Velte,et al.  Cloud Computing, A Practical Approach , 2009 .

[166]  Frédéric Magoulès,et al.  Development of an RDP neural network for building energy consumption fault detection and diagnosis , 2013 .

[167]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.