Big Data for Smart Infrastructure Design: Opportunities and Challenges

Big data is being at the forefront of many ICT-based developments in all spheres of life, be it business, education, or entertainment. Big data is being generated from many diverse sources including social media, Internet of Things (IoT), manufacturing and operations. Big data technologies allow us to take informed decisions from structured or unstructured data. Management and analysis of heterogeneous data generated by various sources brings numerous challenges and diversity in solutions. The aim of this chapter is to discuss different opportunities, issues, and challenges of big data with the main focus on the Hadoop platforms. We provide a detailed survey of opportunities, challenges, and issues of Hadoop-based big data developments in terms of data locality, load balancing, heterogeneity issues, scheduling issues, in-memory computation, multiple query optimizations, and I/O issues. Taxonomy of these challenges and opportunities is also presented.

[1]  Rashid Mehmood,et al.  Autonomic Transport Management Systems—Enabler for Smart Cities, Personalized Medicine, Participation and Industry Grid/Industry 4.0 , 2016 .

[2]  Rashid Mehmood,et al.  Enabling Smarter Societies through Mobile Big Data Fogs and Clouds , 2017, ANT/SEIT.

[3]  Rashid Mehmood,et al.  UTiLearn: A Personalised Ubiquitous Teaching and Learning System for Smart Societies , 2017, IEEE Access.

[4]  Fei Wang,et al.  A MapReduce Task Scheduling Algorithm for Deadline-Constraint in Homogeneous Environment , 2014 .

[5]  Ching-Hsien Hsu,et al.  Locality and loading aware virtual machine mapping techniques for optimizing communications in MapReduce applications , 2015, Future Gener. Comput. Syst..

[6]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Rashid Mehmood,et al.  Location Privacy in Smart Cities Era , 2017 .

[8]  Scott Shenker,et al.  Shark: fast data analysis using coarse-grained distributed memory , 2012, SIGMOD Conference.

[9]  Rashid Mehmood,et al.  Data Fusion and IoT for Smart Ubiquitous Environments: A Survey , 2017, IEEE Access.

[10]  Zongben Xu,et al.  Exploring Big Data Analysis: Fundamental Scientific Problems , 2015 .

[11]  Maozhen Li,et al.  Data locality in Hadoop cluster systems , 2014, 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[12]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[13]  Chitra Babu,et al.  CoHadoop++: A load balanced data co-location in Hadoop Distributed File System , 2013, 2013 Fifth International Conference on Advanced Computing (ICoAC).

[14]  K. Radha,et al.  Slot Utilization and Performance Improvement in Hadoop Cluster , 2016 .

[15]  Michael Lang,et al.  Optimizing load balancing and data-locality with data-aware scheduling , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[16]  C. Kruse,et al.  Challenges and Opportunities of Big Data in Health Care: A Systematic Review , 2016, JMIR medical informatics.

[17]  Yang Yang,et al.  A load balance algorithm based on nodes performance in Hadoop cluster , 2014, The 16th Asia-Pacific Network Operations and Management Symposium.

[18]  Murat Ali Bayir,et al.  Improving the performance of Hadoop Hive by sharing scan and computation tasks , 2014, Journal of Cloud Computing.

[19]  Dan Li,et al.  Dependency-Aware Data Locality for MapReduce , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[20]  Rashid Mehmood,et al.  A Framework for Faster Porting of Scientific Applications Between Heterogeneous Clouds , 2017 .

[21]  Bo Hong,et al.  Grouping Blocks for MapReduce Co-Locality , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[22]  Jo Saunders,et al.  The effects of age on remembering and knowing misinformation , 2010, Memory.

[23]  Dan Suciu,et al.  Distributed query evaluation on semistructured data , 2002, TODS.

[24]  Horacio González-Vélez,et al.  Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments , 2015, 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems.

[25]  Rashid Mehmood,et al.  Big Data Enabled Healthcare Supply Chain Management: Opportunities and Challenges , 2017 .

[26]  Rashid Mehmood,et al.  Big data logistics: a health-care transport capacity sharing model , 2015 .

[27]  Z. Irani,et al.  Critical analysis of Big Data challenges and analytical methods , 2017 .

[28]  Jun Wang,et al.  Optimize Parallel Data Access in Big Data Processing , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[29]  Xin Huang,et al.  Novel heuristic speculative execution strategies in heterogeneous distributed environments , 2016, Comput. Electr. Eng..

[30]  Lu Wang,et al.  Load Balancing in MapReduce Based on Data Locality , 2014, ICA3PP.

[31]  Vikrant Bhateja,et al.  Information Systems Design and Intelligent Applications , 2019, Advances in Intelligent Systems and Computing.

[32]  Wei-Kuan Shih,et al.  LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment , 2013, 2013 International Conference on Collaboration Technologies and Systems (CTS).

[33]  Rashid Mehmood,et al.  Analysis of Tweets in Arabic Language for Detection of Road Traffic Conditions , 2017 .

[34]  Yongxuan Lai,et al.  SALA: A Skew-Avoiding and Locality-Aware Algorithm for MapReduce-Based Join , 2015, WAIM.

[35]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[36]  Ravinder Kaur,et al.  Hadoop: Addressing challenges of Big Data , 2014, 2014 IEEE International Advance Computing Conference (IACC).

[37]  Geoffrey C. Fox,et al.  Investigation of Data Locality in MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[38]  Vincent W. Freeh,et al.  Dynamically Controlling Node-Level Parallelism in Hadoop , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[39]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[40]  Rashid Mehmood,et al.  Enabling Next Generation Logistics and Planning for Smarter Societies , 2017, ANT/SEIT.

[41]  Shicong Meng,et al.  Improving ReduceTask data locality for sequential MapReduce jobs , 2013, 2013 Proceedings IEEE INFOCOM.

[42]  Jun Wang,et al.  DRAW: A new Data-gRouping-AWare data placement scheme for data intensive applications with interest locality , 2012 .

[43]  Rashid Mehmood,et al.  A Smart Pain Management System Using Big Data Computing , 2017 .

[44]  Guoping Wang,et al.  Multi-Query Optimization in MapReduce Framework , 2013, Proc. VLDB Endow..

[45]  Rajiv Ranjan,et al.  G-Hadoop: MapReduce across distributed data centers for data-intensive computing , 2013, Future Gener. Comput. Syst..

[46]  Rashid Mehmood,et al.  Towards a Semantically Enriched Computational Intelligence (SECI) Framework for Smart Farming , 2017 .

[47]  S. Chauhan,et al.  Addressing big data challenges in smart cities: a systematic literature review , 2016 .

[48]  Xuehai Zhou,et al.  Scheduling algorithm based on prefetching in MapReduce clusters , 2016, Appl. Soft Comput..

[49]  Rabi Prasad Padhy Big Data Processing with Hadoop-MapReduce in Cloud Systems , 2012, CloudCom 2012.

[50]  Yang Wang,et al.  Smart Shuffling in MapReduce: A Solution to Balance Network Traffic and Workloads , 2015, 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC).

[51]  Hashem Omrani,et al.  An integrated multi-objective Markowitz-DEA cross-efficiency model with fuzzy returns for portfolio selection problem , 2016, Appl. Soft Comput..

[52]  Shrideep Pallickara,et al.  A Survey of Load Balancing Techniques for Data Intensive Computing , 2011 .

[53]  Rashid Mehmood,et al.  Future Networked Healthcare Systems: A Review and Case Study , 2016 .

[54]  Adnan Yazici,et al.  Improving Hadoop Hive Query Response Times Through Efficient Virtual Resource Allocation , 2015, FQAS.

[55]  Madhusudhan Govindaraju,et al.  MARLA: MapReduce for Heterogeneous Clusters , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[56]  Reema Rhine,et al.  Locality Aware MapReduce , 2015, IBICA.

[57]  Walid G. Aref,et al.  Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop , 2016, WSDM '16.

[58]  Rashid Mehmood,et al.  Disaster Management in Smart Cities by Forecasting Traffic Plan Using Deep Learning and GPUs , 2017 .

[59]  Ruini Xue,et al.  BOLAS: Bipartite-Graph Oriented Locality-Aware Scheduling for MapReduce Tasks , 2015, 2015 14th International Symposium on Parallel and Distributed Computing.

[60]  Sun-Yuan Hsieh,et al.  A Dynamic Data Placement Strategy for Hadoop in Heterogeneous Environments , 2014, Big Data Res..

[61]  Yuhong Feng,et al.  An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments , 2011, 2011 International Conference on Cloud and Service Computing.

[62]  Zhenjiang Hu,et al.  Efficient query evaluation on distributed graphs with Hadoop environment , 2013, SoICT '13.

[63]  Seung Ryoul Maeng,et al.  Locality-aware dynamic VM reconfiguration on MapReduce clouds , 2012, HPDC '12.

[64]  Liu Yang,et al.  New improvement of the Hadoop relevant data locality scheduling algorithm based on LATE , 2011, 2011 International Conference on Mechatronic Science, Electric Engineering and Computer (MEC).

[65]  Rashid Mehmood,et al.  Parallel Sparse Matrix Vector Multiplication on Intel MIC: Performance Analysis , 2017 .

[66]  S. Sujitha,et al.  Aggrandizing Hadoop in terms of node Heterogeneity & Data Locality , 2013, INTERNATIONAL CONFERENCE ON SMART STRUCTURES AND SYSTEMS - ICSSS'13.

[67]  Dharavath Ramesh,et al.  Delay Scheduling with Reduced Workload on JobTracker in Hadoop , 2015, IBICA.

[68]  Takashi Gojobori,et al.  DNA Profiling Methods and Tools: A Review , 2017 .

[69]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[70]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[71]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[72]  Rashid Mehmood,et al.  UbeHealth: A Personalized Ubiquitous Cloud and Edge-Enabled Networked Healthcare System for Smart Cities , 2018, IEEE Access.

[73]  Wei Tang,et al.  FlexAnalytics: A Flexible Data Analytics Framework for Big Data Applications with I/O Performance Improvement , 2014, Big Data Res..

[74]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[75]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[76]  Rajashekhar M. Arasanal,et al.  Improving MapReduce Performance through Complexity and Performance Based Data Placement in Heterogeneous Hadoop Clusters , 2013, ICDCIT.

[77]  Rashid Mehmood,et al.  Big Data and HPC Convergence: The Cutting Edge and Outlook , 2017 .

[78]  Cherif A. A. BISSIRIOU,et al.  Big data analysis and query optimization improve HadoopDB performance , 2014, SEM '14.

[79]  Zhiyang Li,et al.  Balancing reducer workload for skewed data using sampling-based partitioning , 2014, Comput. Electr. Eng..

[80]  Rashid Mehmood,et al.  Automatic Event Detection in Smart Cities Using Big Data Analytics , 2017 .

[81]  Rashid Mehmood,et al.  D2TFRS: An Object Recognition Method for Autonomous Vehicles Based on RGB and Spatial Values of Pixels , 2017 .

[82]  S GuruPrasadM,et al.  Performance Analysis of Schedulers to Handle Multi Jobs in Hadoop Cluster , 2015 .

[83]  G. Sudha Sadasivam,et al.  A novel parallel hybrid PSO-GA using MapReduce to schedule jobs in Hadoop data grids , 2010, 2010 Second World Congress on Nature and Biologically Inspired Computing (NaBIC).

[84]  Rong Gu,et al.  SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters , 2014, J. Parallel Distributed Comput..

[85]  Rashid Mehmood,et al.  Parallel Shortest Path Graph Computations of United States Road Network Data on Apache Spark , 2017 .

[86]  N. B. Anuar,et al.  The rise of "big data" on cloud computing: Review and open research issues , 2015, Inf. Syst..

[87]  Sang-goo Lee,et al.  Handling data skew in join algorithms using MapReduce , 2016, Expert Syst. Appl..

[88]  Fang Dong,et al.  Multi-Q: Multiple Queries Optimization Based on MapReduce in Cloud , 2014 .

[89]  Rashid Mehmood,et al.  Enabling Reliable and Resilient IoT Based Smart City Applications , 2017 .

[90]  Qiaoyan Wen,et al.  Load balancing solution based on AHP for Hadoop , 2014, 2014 IEEE Workshop on Electronics, Computer and Applications.

[91]  Wahiba Bahsoun,et al.  An Exploratory Study on Using Social Information Networks for Flexible Literature Access , 2009, FQAS.

[92]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[93]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[94]  D. Ramesh,et al.  Improved Task Graph-based Parallel Data Processing for Dynamic Resource Allocation in Cloud , 2012 .

[95]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[96]  Cong Xu,et al.  Virtual Shuffling for Efficient Data Movement in MapReduce , 2015, IEEE Transactions on Computers.