A Survey on Geographically Distributed Big-Data Processing Using MapReduce

Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.

[1]  Paolo Papotti,et al.  Road to Freedom in Big Data Analytics , 2016, EDBT.

[2]  Minlan Yu,et al.  Scheduling jobs across geo-distributed datacenters , 2015, SoCC.

[3]  Navendu Jain,et al.  An empirical analysis of intra- and inter-datacenter network failures for geo-distributed services , 2013, SIGMETRICS '13.

[4]  Bin Cheng,et al.  GeeLytics: Geo-distributed edge analytics for large scale IoT systems based on dynamic topology , 2015, 2015 IEEE 2nd World Forum on Internet of Things (WF-IoT).

[5]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[6]  Dan Suciu,et al.  Parallel Skyline Queries , 2012, Theory of Computing Systems.

[7]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[8]  Murat Kantarcioglu,et al.  SEMROD: Secure and Efficient MapReduce Over HybriD Clouds , 2015, SIGMOD Conference.

[9]  Dick H. J. Epema,et al.  KOALA: a co‐allocating grid scheduler , 2008, Concurr. Comput. Pract. Exp..

[10]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[11]  Jeffrey D. Ullman,et al.  Matching bounds for the all-pairs MapReduce problem , 2013, IDEAS '13.

[12]  Minyi Guo,et al.  Pricing and Repurchasing for Big Data Processing in Multi-Clouds , 2016, IEEE Transactions on Emerging Topics in Computing.

[13]  Carlo Curino,et al.  Towards Geo-Distributed Machine Learning , 2017, IEEE Data Eng. Bull..

[14]  Zoe L. Jiang,et al.  Key based data analytics across data centers considering bi-level resource provision in cloud computing , 2016, Future Gener. Comput. Syst..

[15]  Carlo Curino,et al.  WANalytics: Analytics for a Geo-Distributed Data-Intensive World , 2015, CIDR.

[16]  Patrick Th. Eugster,et al.  Efficient Geo-distributed Data Processing with Rout , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[17]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[18]  Nikos Parlavantzas,et al.  Resilin: Elastic MapReduce over Multiple Clouds , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[19]  Tova Milo,et al.  An Efficient MapReduce Cube Algorithm for Varied DataDistributions , 2016, SIGMOD Conference.

[20]  Christof Fetzer,et al.  EHadoop: Network I/O Aware Scheduler for Elastic MapReduce Cluster , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[21]  Ehud Gudes,et al.  Security and privacy aspects in MapReduce on clouds: A survey , 2016, Comput. Sci. Rev..

[22]  Yuan Yuan,et al.  Major technical advancements in apache hive , 2014, SIGMOD Conference.

[23]  Song Guo,et al.  Traffic-Aware Geo-Distributed Big Data Analytics with Predictable Job Completion Time , 2017, IEEE Transactions on Parallel and Distributed Systems.

[24]  Jeffrey D. Ullman,et al.  Assignment Problems of Different-Sized Inputs in MapReduce , 2015, ACM Trans. Knowl. Discov. Data.

[25]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[26]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[27]  Ian T. Foster,et al.  Differentiated Scheduling of Response-Critical and Best-Effort Wide-Area Data Transfers , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28]  Chenyu Wang,et al.  Exploring MapReduce efficiency with highly-distributed data , 2011, MapReduce '11.

[29]  Yanfei Guo Moving MapReduce into the cloud: Elasticity, efficiency and scalability , 2015 .

[30]  Gabriel Antoniu,et al.  SAGE: Geo-Distributed Streaming Data Analysis in Clouds , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[31]  Patrick Th. Eugster,et al.  From the Cloud to the Atmosphere: Running MapReduce across Data Centers , 2014, IEEE Transactions on Computers.

[32]  Jeffrey D. Ullman,et al.  Meta-MapReduce: A Technique for Reducing Communication in MapReduce Computations , 2015, ArXiv.

[33]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[34]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[35]  Anshul Jaiswal,et al.  Realtime Data Processing at Facebook , 2016, SIGMOD Conference.

[36]  Song Guo,et al.  Cost Minimization for Big Data Processing in Geo-Distributed Data Centers , 2014, IEEE Transactions on Emerging Topics in Computing.

[37]  Ian Rae,et al.  F1: A Distributed SQL Database That Scales , 2013, Proc. VLDB Endow..

[38]  Thomas Heinis,et al.  THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads , 2015, SIGMOD Conference.

[39]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[40]  Ramesh K. Sitaraman,et al.  Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics , 2016, SoCC.

[41]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[42]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[43]  Carlo Curino,et al.  Global Analytics in the Face of Bandwidth and Regulatory Constraints , 2015, NSDI.

[44]  Miguel Correia,et al.  Medusa: An Efficient Cloud Fault-Tolerant MapReduce , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[45]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[46]  Manish Parashar,et al.  A case for MapReduce over the internet , 2013, CAC.

[47]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[48]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[49]  György Turán,et al.  On the Computational Complexity of MapReduce , 2015, DISC.

[50]  Ramesh K. Sitaraman,et al.  Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics , 2015, HPDC.

[51]  Kenneth A. Hawick,et al.  Distributed frameworks and parallel algorithms for processing large-scale geographic data , 2003, Parallel Comput..

[52]  Neoklis Polyzotis,et al.  Iterative MapReduce for Large Scale Machine Learning , 2013, ArXiv.

[53]  Xue-wen Chen,et al.  Large-Scale Deep Belief Nets With MapReduce , 2014, IEEE Access.

[54]  Michael J. Freedman,et al.  Making Every Bit Count in Wide-Area Analytics , 2013, HotOS.

[55]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[56]  Gautam Shroff,et al.  Graph-Parallel Entity Resolution using LSH & IMM , 2014, EDBT/ICDT Workshops.

[57]  Cong Yu,et al.  Data Cube Materialization and Mining over MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[58]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[59]  Dick H. J. Epema,et al.  Resource Management for Dynamic MapReduce Clusters in Multicluster Systems , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[60]  David A. Maltz,et al.  Surviving failures in bandwidth-constrained datacenters , 2012, CCRV.

[61]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[62]  Divyakant Agrawal,et al.  DB-Risk: The Game of Global Database Placement , 2016, SIGMOD Conference.

[63]  Janak H. Patel,et al.  Model of Computation , 1990 .

[64]  Haifeng Jiang,et al.  Photon: fault-tolerant and scalable joining of continuous data streams , 2013, SIGMOD '13.

[65]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[66]  Yongli Zhu,et al.  Cache conscious star-join in MapReduce environments , 2013, Cloud-I '13.

[67]  Kamesh Munagala,et al.  Complexity Measures for Map-Reduce, and Comparison to Parallel Computing , 2012, ArXiv.

[68]  Osamu Tatebe,et al.  Gfarm Grid File System , 2010, New Generation Computing.

[69]  Sanjay Kumar Madria Security and Risk Assessment in the Cloud , 2016, Computer.

[70]  Kyungho Jeon,et al.  The HybrEx Model for Confidentiality and Privacy in Cloud Computing , 2011, HotCloud.

[71]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[72]  Himanshu Gupta,et al.  ε-Controlled-Replicate: An ImprovedControlled-Replicate Algorithm for Multi-way Spatial Join Processing on Map-Reduce , 2014, WISE.

[73]  Vinod Kumar Vavilapalli,et al.  Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 , 2014 .

[74]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[75]  Qi Zhang,et al.  Improving Hadoop Service Provisioning in a Geographically Distributed Cloud , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[76]  L. Venkata Subramaniam,et al.  Processing Interval Joins On Map-Reduce , 2014, EDBT.

[77]  Gilles Fedak,et al.  HybridMR: a new approach for hybrid MapReduce combining desktop grid and cloud infrastructures , 2015, Concurr. Comput. Pract. Exp..

[78]  Divyakant Agrawal,et al.  The Challenges of Global-Scale Data Management , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[79]  Vivek Kundra,et al.  Federal Cloud Computing Strategy , 2011 .

[80]  Joaquim Sousa Pinto,et al.  Sky computing , 2011, 6th Iberian Conference on Information Systems and Technologies (CISTI 2011).

[81]  Manish Parashar,et al.  Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets , 2011, PERV.

[82]  Jimmy J. Lin,et al.  Summingbird: A Framework for Integrating Batch and Online MapReduce Computations , 2014, Proc. VLDB Endow..

[83]  Chenyu Wang,et al.  Cross-Phase Optimization in MapReduce , 2013, 2013 IEEE International Conference on Cloud Engineering (IC2E).

[84]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[85]  Ashish Gupta,et al.  High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads , 2015, BIRTE.

[86]  Matei Zaharia,et al.  Matrix Computations and Optimization in Apache Spark , 2015, KDD.

[87]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[88]  Abhishek Chandra,et al.  Nebula: Distributed Edge Cloud for Data Intensive Computing , 2014, 2014 IEEE International Conference on Cloud Engineering.

[89]  Ramesh K. Sitaraman,et al.  End-to-End Optimization for Geo-Distributed MapReduce , 2016, IEEE Transactions on Cloud Computing.

[90]  XiaoFeng Wang,et al.  Sedic: privacy-aware data intensive computing on hybrid clouds , 2011, CCS '11.

[91]  Paramvir Bahl,et al.  Low Latency Geo-distributed Data Analytics , 2015, SIGCOMM.

[92]  Abhishek Chandra,et al.  Redefining Data Locality for Cross-Data Center Storage , 2015, BigSystem@HPDC.

[93]  Dick H. J. Epema,et al.  Dynamically Scheduling a Component-Based Framework in Clusters , 2014, JSSPP.

[94]  Shayan Saeed Sandooq: improving the communication cost and service latency for a multi-user erasure-coded geo-distributed cloud environment , 2016 .

[95]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[96]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[97]  Harumi A. Kuno,et al.  The mixed workload CH-benCHmark , 2011, DBTest '11.

[98]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[99]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[100]  Huajun Chen,et al.  MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network , 2009, APPT.

[101]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[102]  Roland H. C. Yap,et al.  Tagged-MapReduce: A General Framework for Secure Computing with Mixed-Sensitivity Data on Hybrid Clouds , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[103]  Michael Stonebraker,et al.  The BigDAWG polystore system and architecture , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[104]  Fan Yang,et al.  Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing , 2014, Proc. VLDB Endow..

[105]  Nicolas Bruno,et al.  Spanner: Becoming a SQL System , 2017, SIGMOD Conference.

[106]  Abhishek Chandra,et al.  Awan: Locality-Aware Resource Manager for Geo-Distributed Data-Intensive Applications , 2016, 2016 IEEE International Conference on Cloud Engineering (IC2E).

[107]  Divyakant Agrawal,et al.  Multi-representation Based Data Processing Architecture for IoT Applications , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[108]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[109]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[110]  Margarida Mamede,et al.  PIXIDA: Optimizing Data Parallel Jobs in Wide-Area Data Analytics , 2015, Proc. VLDB Endow..

[111]  Giuseppe Di Modica,et al.  H2F: A Hierarchical Hadoop Framework for Big Data Processing in Geo-Distributed Environments , 2016, 2016 IEEE/ACM 3rd International Conference on Big Data Computing Applications and Technologies (BDCAT).

[112]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[113]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[114]  Jeffrey D. Ullman,et al.  Enumerating subgraph instances using map-reduce , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[115]  Jun Luo,et al.  Flutter: Scheduling tasks closer to data across geo-distributed datacenters , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[116]  Chen He,et al.  HOG: Distributed Hadoop MapReduce on the Grid , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[117]  Thomas F. Wenisch,et al.  Minimizing Remote Accesses in MapReduce Clusters , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[118]  Bingsheng He,et al.  On Achieving Efficient Data Transfer for Graph Processing in Geo-Distributed Datacenters , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[119]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[120]  Michael T. Goodrich,et al.  Simulating Parallel Algorithms in the MapReduce Framework with Applications to Parallel Computational Geometry , 2010, ArXiv.

[121]  Murat Kantarcioglu,et al.  Secure and Efficient Query Processing over Hybrid Clouds , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[122]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[123]  Jeffrey D. Ullman,et al.  Bounds for Overlapping Interval Join on MapReduce , 2015, EDBT/ICDT Workshops.

[124]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[125]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[126]  Rui Wang,et al.  Bridging Data in the Clouds: An Environment-Aware System for Geographically Distributed Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[127]  Giuseppe Di Modica,et al.  Application profiling in hierarchical Hadoop for geo-distributed computing environments , 2016, 2016 IEEE Symposium on Computers and Communication (ISCC).

[128]  Michael J. Franklin,et al.  GridDB: A Database Interface to the Grid. , 2003, SIGMOD 2003.

[129]  Alexandru Iosup,et al.  Balanced resource allocations across multiple dynamic MapReduce clusters , 2014, SIGMETRICS '14.

[130]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[131]  Jorge Luis Rodriguez,et al.  The Open Science Grid , 2005 .

[132]  Zhenni Li,et al.  Tology-Aware Optimal Data Placement Algorithm for Network Traffic Optimization , 2016, IEEE Transactions on Computers.

[133]  Shanika Karunasekera,et al.  Distributed stream clustering using micro-clusters on Apache Storm , 2017, J. Parallel Distributed Comput..

[134]  Michael J. Freedman,et al.  Aggregation and Degradation in JetStream: Streaming Analytics in the Wide Area , 2014, NSDI.

[135]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[136]  Kyle Banker,et al.  MongoDB in Action , 2011 .

[137]  Athanasios V. Vasilakos,et al.  Multimedia Applications and Security in MapReduce: Opportunities and Challenges , 2012, Concurr. Comput. Pract. Exp..

[138]  Alec Wolman,et al.  Volley: Automated Data Placement for Geo-Distributed Cloud Services , 2010, NSDI.

[139]  Yuan Luo,et al.  Hierarchical MapReduce Programming Model and Scheduling Algorithms , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[140]  Jeffrey D. Ullman Designing good MapReduce algorithms , 2012, XRDS.

[141]  Jeffrey D. Ullman,et al.  Vision Paper: Towards an Understanding of the Limits of Map-Reduce Computation , 2012, ArXiv.

[142]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.