Big Data computing and clouds: Trends and future directions

This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications. It revolves around four important areas of analytics and Big Data, namely (i) data management and supporting architectures; (ii) model development and scoring; (iii) visualisation and user interaction; and (iv) business models. Through a detailed survey, we identify possible gaps in technology and provide recommendations for the research community on future directions on Cloud-supported Big Data computing and analytics solutions. Survey of solutions for carrying out analytics and Big Data on Clouds.Identification of gaps in technology for Cloud-based analytics.Recommendations of research directions for Cloud-based analytics and Big Data.

[1]  Wendy Hui Wang Integrity verification of cloud-hosted data analytics computations , 2012, Cloud-I '12.

[2]  John Domingue,et al.  The Future of the Internet , 1999, Academia Letters.

[3]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[4]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[5]  I. Song,et al.  Analytics over large-scale multidimensional data: the big data revolution! , 2011, DOLAP '11.

[6]  Bill Franks,et al.  Taming The Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics , 2012 .

[7]  Eric A. King,et al.  A Framework for Avoiding Costly Project Pitfalls in Predictive Analytics , 2008 .

[8]  Robert L. Grossman What is analytic infrastructure and why should you care? , 2009, SKDD.

[9]  Bo Gao,et al.  A Cost-Effective Approach to Delivering Analytics as a Service , 2012, 2012 IEEE 19th International Conference on Web Services.

[10]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[11]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[12]  Florian Stahl,et al.  Marketplaces for data: an initial survey , 2013, SGMD.

[13]  Daniel J. Abadi,et al.  Data Management in the Cloud: Limitations and Opportunities , 2009, IEEE Data Eng. Bull..

[14]  Jaegul Choo,et al.  Customizing Computational Methods for Visual Analytics with Big Data , 2013, IEEE Computer Graphics and Applications.

[15]  Pak Chung Wong,et al.  Feature Tracking and Visualization of the Madden-Julian Oscillation in Climate Simulation , 2013, IEEE Computer Graphics and Applications.

[16]  Imad Aad,et al.  The Mobile Data Challenge: Big Data for Mobile Computing Research , 2012 .

[17]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[18]  Geoffrey Fox Large scale data analytics on clouds , 2012, CloudDB '12.

[19]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[20]  Geoffrey C. Fox,et al.  Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[21]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Rajkumar Buyya,et al.  The Aneka platform and QoS-driven resource provisioning for elastic applications on hybrid Clouds , 2012, Future Gener. Comput. Syst..

[24]  Michael E. Papka,et al.  Visualizing Large, Heterogeneous Data in Hybrid-Reality Environments , 2013, IEEE Computer Graphics and Applications.

[25]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[26]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[27]  Mary Czerwinski,et al.  Interactions with big data analytics , 2012, INTR.

[28]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[29]  Wei Lu,et al.  Project Daytona: Data Analytics as a Cloud Service , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[30]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[31]  R. Bonney,et al.  Next Steps for Citizen Science , 2014, Science.

[32]  BuyyaRajkumar,et al.  A taxonomy of Data Grids for distributed data sharing, management, and processing , 2006 .

[33]  Ashish Verma,et al.  Enabling analysts in managed services for CRM analytics , 2009, KDD.

[34]  Sujatha R. Upadhyaya,et al.  Parallel approaches to machine learning - A comprehensive survey , 2013, J. Parallel Distributed Comput..

[35]  Qiming Chen,et al.  Experience in Continuous analytics as a Service (CaaaS) , 2011, EDBT/ICDT '11.

[36]  Graham J. Williams,et al.  PMML: An Open Standard for Sharing Models , 2009, R J..

[37]  Murray Campbell,et al.  Analytics Ecosystem Transformation: A Force for Business Model Innovation , 2011, 2011 Annual SRII Global Conference.

[38]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[39]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[40]  Eugene Ciurana,et al.  Google App Engine , 2009 .

[41]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[42]  Ann L. Chervenak,et al.  Data Management Challenges of Data-Intensive Scientific Workflows , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[43]  Shivnath Babu,et al.  Cumulon: optimizing statistical data analysis in the cloud , 2013, SIGMOD '13.

[44]  Long Wang,et al.  Design and Development of an Adaptive Workflow-Enabled Spatial-Temporal Analytics Framework , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[45]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[46]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[47]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[48]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[49]  Chen Wang,et al.  Visual analysis of large-scale network anomalies , 2013, IBM J. Res. Dev..

[50]  Daniel M. Batista,et al.  A Survey of Large Scale Data Management Approaches in Cloud Environments , 2011, IEEE Communications Surveys & Tutorials.

[51]  GhemawatSanjay,et al.  The Google file system , 2003 .

[52]  Manish Parashar,et al.  Online Risk Analytics on the Cloud , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[53]  Wolfgang Lehner,et al.  Using Cloud Technologies to Optimize Data-Intensive Service Applications , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[54]  Neal Leavitt,et al.  Will NoSQL Databases Live Up to Their Promise? , 2010, Computer.

[55]  Thomas H. Davenport,et al.  Analytics at Work: Smarter Decisions, Better Results , 2010 .

[56]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[57]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[58]  Hamid Pirahesh,et al.  A platform for eXtreme Analytics , 2013, IBM J. Res. Dev..

[59]  Martin Wattenberg,et al.  ManyEyes: a Site for Visualization at Internet Scale , 2007, IEEE Transactions on Visualization and Computer Graphics.

[60]  Rajkumar Buyya,et al.  A taxonomy of Data Grids for distributed data sharing, management, and processing , 2005, CSUR.

[61]  John Byrne,et al.  Workload diversity and dynamics in big data analytics: implications to system designers , 2012, ASBD '12.

[62]  Bo Gao,et al.  Towards Delivering Analytical Solutions in Cloud: Business Models and Technical Challenges , 2011, 2011 IEEE 8th International Conference on e-Business Engineering.

[63]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[64]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[65]  Chandra Krintz,et al.  An Evaluation of Distributed Datastores Using the AppScale Cloud Platform , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[66]  Bin Song,et al.  Towards Building an Integrated Information Platform for Eco-city , 2010, 2010 IEEE 7th International Conference on E-Business Engineering.

[67]  Sachchidanand Singh,et al.  Big Data analytics , 2012 .

[68]  Cheryl A. Kieliszewski,et al.  Analytical Pathway Methodology: Simplifying Business Intelligence Consulting , 2011, 2011 Annual SRII Global Conference.

[70]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[71]  Melnned M. Kantardzic Big Data Analytics , 2013, Lecture Notes in Computer Science.

[72]  Ian Witten,et al.  Data Mining , 2000 .

[73]  Long Wang,et al.  A Framework for Cloud-Based Large-Scale Data Analytics and Visualization: Case Study on Multiscale Climate Data , 2011, 2011 IEEE Third International Conference on Cloud Computing Technology and Science.

[74]  MoonBongki,et al.  Parallel data processing with MapReduce , 2012 .

[75]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[76]  Ronen Feldman,et al.  Techniques and applications for sentiment analysis , 2013, CACM.

[77]  Scott Murray,et al.  Interactive Data Visualization for the Web , 2013 .

[78]  Pei-Yun Sabrina Hsueh,et al.  Cloud-based platform for personalization in a wellness management ecosystem: Why, what, and how , 2010, 6th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2010).

[79]  Jun Rao,et al.  Building LinkedIn's Real-time Activity Data Pipeline , 2012, IEEE Data Eng. Bull..

[80]  Alexandru Iosup,et al.  CAMEO: Enabling social networks for Massively Multiplayer Online Games through Continuous Analytics and cloud computing , 2010, 2010 9th Annual Workshop on Network and Systems Support for Games.

[81]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[82]  Michael Zeller,et al.  Efficient deployment of predictive analytics through open standards and cloud computing , 2009, SKDD.

[83]  Zhen Li,et al.  CloudVista: Interactive and Economical Visual Cluster Analysis for Big Data in the Cloud , 2012, Proc. VLDB Endow..

[84]  Keke Chen,et al.  CloudVista: Visual Cluster Exploration for Extreme Scale Data in the Cloud , 2011, SSDBM.

[85]  Jock D. Mackinlay,et al.  Storytelling: The Next Step for Visualization , 2013, Computer.

[86]  Stefan Wrobel,et al.  Visual analytics tools for analysis of movement data , 2007, SKDD.

[87]  Daniel S. Katz,et al.  Survey and Analysis of Production Distributed Computing Infrastructures , 2012, ArXiv.

[88]  Prashant Pandey,et al.  Cloud Analytics: Do We Really Need to Reinvent the Storage Stack? , 2009, HotCloud.

[89]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[90]  Chandra Krintz,et al.  Hybrid Cloud Support for Large Scale Analytics and Web Processing , 2012, WebApps.

[91]  Gueyoung Jung,et al.  Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[92]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[93]  Randy H. Katz,et al.  Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud , 2011, HotCloud.

[94]  S. Hurst [Trust me...]. , 2012, Revue medicale suisse.

[95]  Christopher Ré,et al.  Hazy: Making it Easier to Build and Maintain Big-data Analytics , 2013, CACM.

[96]  Jeanne G. Harris,et al.  Competing on Analytics: The New Science of Winning , 2007 .

[97]  Tim Menzies,et al.  Software Analytics: So What? , 2013, IEEE Softw..

[98]  Karin Murthy,et al.  Configurable and Extensible Multi-flows for Providing Analytics as a Service on the Cloud , 2012, 2012 Annual SRII Global Conference.

[99]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[100]  Xiaorong Li,et al.  Hybrid Heuristic for Scheduling Data Analytics Workflow Applications in Hybrid Cloud Environment , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[101]  Stefan Jablonski,et al.  NoSQL evaluation: A use case oriented survey , 2011, 2011 International Conference on Cloud and Service Computing.

[102]  FanWei,et al.  Mining big data , 2013 .

[103]  Yanpei Chen,et al.  Energy efficiency for large-scale MapReduce workloads with significant interactive analysis , 2012, EuroSys '12.

[104]  Vanish Talwar,et al.  A flexible architecture integrating monitoring and analytics for managing large-scale data centers , 2011, ICAC '11.

[105]  Paul T. Groth,et al.  The provenance of electronic data , 2008, CACM.

[106]  Cheri A. Levinson,et al.  Profiling , 2012 .