Big Data Computing and Clouds: Challenges, Solutions, and Future Directions

This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications. It revolves around four important areas of analytics and Big Data, namely (i) data management and supporting architectures; (ii) model development and scoring; (iii) visualisation and user interaction; and (iv) business models. Through a detailed survey, we identify possible gaps in technology and provide recommendations for the research community on future directions on Cloud-supported Big Data computing and analytics solutions.

[1]  Manish Parashar,et al.  Online Risk Analytics on the Cloud , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[2]  Geoffrey C. Fox,et al.  Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[4]  Vanish Talwar,et al.  A flexible architecture integrating monitoring and analytics for managing large-scale data centers , 2011, ICAC '11.

[5]  Murray Campbell,et al.  Analytics Ecosystem Transformation: A Force for Business Model Innovation , 2011, 2011 Annual SRII Global Conference.

[6]  Wolfgang Lehner,et al.  Using Cloud Technologies to Optimize Data-Intensive Service Applications , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[7]  Neal Leavitt,et al.  Will NoSQL Databases Live Up to Their Promise? , 2010, Computer.

[8]  Cheri A. Levinson,et al.  Profiling , 2012 .

[9]  Thomas H. Davenport,et al.  Analytics at Work: Smarter Decisions, Better Results , 2010 .

[10]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[11]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[12]  Chandra Krintz,et al.  Hybrid Cloud Support for Large Scale Analytics and Web Processing , 2012, WebApps.

[13]  Stefan Jablonski,et al.  NoSQL evaluation: A use case oriented survey , 2011, 2011 International Conference on Cloud and Service Computing.

[14]  Shivnath Babu,et al.  Cumulon: optimizing statistical data analysis in the cloud , 2013, SIGMOD '13.

[15]  Bo Gao,et al.  A Cost-Effective Approach to Delivering Analytics as a Service , 2012, 2012 IEEE 19th International Conference on Web Services.

[16]  Karin Murthy,et al.  Configurable and Extensible Multi-flows for Providing Analytics as a Service on the Cloud , 2012, 2012 Annual SRII Global Conference.

[17]  Gueyoung Jung,et al.  Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[18]  Randy H. Katz,et al.  Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud , 2011, HotCloud.

[19]  S. Hurst [Trust me...]. , 2012, Revue medicale suisse.

[20]  Cheryl A. Kieliszewski,et al.  Analytical Pathway Methodology: Simplifying Business Intelligence Consulting , 2011, 2011 Annual SRII Global Conference.

[21]  Christopher Ré,et al.  Hazy: Making it Easier to Build and Maintain Big-data Analytics , 2013, CACM.

[22]  Hui Integrity Verification of Cloud-hosted Data Analytics Computations , 2012 .

[23]  Ashish Verma,et al.  Enabling analysts in managed services for CRM analytics , 2009, KDD.

[24]  I. Song,et al.  Analytics over large-scale multidimensional data: the big data revolution! , 2011, DOLAP '11.

[25]  Bill Franks,et al.  Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics , 2012 .

[26]  Long Wang,et al.  Design and Development of an Adaptive Workflow-Enabled Spatial-Temporal Analytics Framework , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[27]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[28]  Yanpei Chen,et al.  Energy efficiency for large-scale MapReduce workloads with significant interactive analysis , 2012, EuroSys '12.

[29]  Sujatha R. Upadhyaya,et al.  Parallel approaches to machine learning - A comprehensive survey , 2013, J. Parallel Distributed Comput..

[30]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[31]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[32]  Martin Wattenberg,et al.  ManyEyes: a Site for Visualization at Internet Scale , 2007, IEEE Transactions on Visualization and Computer Graphics.

[33]  Qiming Chen,et al.  Experience in Continuous analytics as a Service (CaaaS) , 2011, EDBT/ICDT '11.

[34]  Graham J. Williams,et al.  PMML: An Open Standard for Sharing Models , 2009, R J..

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Robert L. Grossman What is analytic infrastructure and why should you care? , 2009, SKDD.

[37]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[38]  Tim Menzies,et al.  Software Analytics: So What? , 2013, IEEE Softw..

[39]  FanWei,et al.  Mining big data , 2013 .

[40]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[41]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[42]  Daniel J. Abadi,et al.  Data Management in the Cloud: Limitations and Opportunities , 2009, IEEE Data Eng. Bull..

[43]  Jaegul Choo,et al.  Customizing Computational Methods for Visual Analytics with Big Data , 2013, IEEE Computer Graphics and Applications.

[44]  Pak Chung Wong,et al.  Feature Tracking and Visualization of the Madden-Julian Oscillation in Climate Simulation , 2013, IEEE Computer Graphics and Applications.

[45]  Imad Aad,et al.  The Mobile Data Challenge: Big Data for Mobile Computing Research , 2012 .

[46]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[47]  Geoffrey Fox Large scale data analytics on clouds , 2012, CloudDB '12.

[48]  Ronen Feldman,et al.  Techniques and applications for sentiment analysis , 2013, CACM.

[49]  Scott Murray,et al.  Interactive Data Visualization for the Web , 2013 .

[50]  Pei-Yun Sabrina Hsueh,et al.  Cloud-based platform for personalization in a wellness management ecosystem: Why, what, and how , 2010, 6th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2010).

[51]  Michael Zeller,et al.  Efficient deployment of predictive analytics through open standards and cloud computing , 2009, SKDD.

[52]  Jock D. Mackinlay,et al.  Storytelling: The Next Step for Visualization , 2013, Computer.

[53]  Stefan Wrobel,et al.  Visual analytics tools for analysis of movement data , 2007, SKDD.

[54]  John Byrne,et al.  Workload diversity and dynamics in big data analytics: implications to system designers , 2012, ASBD '12.

[55]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[56]  Xiaorong Li,et al.  Hybrid Heuristic for Scheduling Data Analytics Workflow Applications in Hybrid Cloud Environment , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[57]  Alexandru Iosup,et al.  CAMEO: Enabling social networks for Massively Multiplayer Online Games through Continuous Analytics and cloud computing , 2010, 2010 9th Annual Workshop on Network and Systems Support for Games.

[58]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[59]  Anastasius Gavras,et al.  The Future Internet.The Future Internet , 2013 .

[60]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[61]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[62]  Hamid Pirahesh,et al.  A platform for eXtreme Analytics , 2013, IBM J. Res. Dev..

[63]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[64]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[65]  Daniel S. Katz,et al.  Survey and Analysis of Production Distributed Computing Infrastructures , 2012, ArXiv.

[66]  Prashant Pandey,et al.  Cloud Analytics: Do We Really Need to Reinvent the Storage Stack? , 2009, HotCloud.

[67]  I. Yeoman Competing on analytics: The new science of winning , 2009 .

[68]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[69]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[70]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[71]  Chandra Krintz,et al.  An Evaluation of Distributed Datastores Using the AppScale Cloud Platform , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[72]  Bin Song,et al.  Towards Building an Integrated Information Platform for Eco-city , 2010, 2010 IEEE 7th International Conference on E-Business Engineering.

[73]  Sachchidanand Singh,et al.  Big Data analytics , 2012 .

[74]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[75]  Mary Czerwinski,et al.  Interactions with big data analytics , 2012, INTR.

[76]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[77]  Wei Lu,et al.  Project Daytona: Data Analytics as a Cloud Service , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[78]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[79]  Rajkumar Buyya,et al.  The Aneka platform and QoS-driven resource provisioning for elastic applications on hybrid Clouds , 2012, Future Gener. Comput. Syst..

[80]  Michael E. Papka,et al.  Visualizing Large, Heterogeneous Data in Hybrid-Reality Environments , 2013, IEEE Computer Graphics and Applications.

[81]  Melnned M. Kantardzic Big Data Analytics , 2013, Lecture Notes in Computer Science.

[82]  Ian Witten,et al.  Data Mining , 2000 .

[83]  Long Wang,et al.  A Framework for Cloud-Based Large-Scale Data Analytics and Visualization: Case Study on Multiscale Climate Data , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[84]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[85]  Florian Stahl,et al.  Marketplaces for data: an initial survey , 2013, SGMD.

[86]  Chen Wang,et al.  Visual analysis of large-scale network anomalies , 2013, IBM J. Res. Dev..

[87]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[88]  GhemawatSanjay,et al.  The Google file system , 2003 .

[89]  Bo Gao,et al.  Towards Delivering Analytical Solutions in Cloud: Business Models and Technical Challenges , 2011, 2011 IEEE 8th International Conference on e-Business Engineering.

[90]  Eugene Ciurana,et al.  Google App Engine , 2009 .