Efficient Data Analytics Over Cloud

Abstract Many industries, such as telecom, health-care, retail, pharmaceutical, financial services, etc., generate large amounts of data. Such large amount of data needs to be processed quickly for gaining critical business insights. The data warehouses and solutions built around them are unable to provide reasonable response times in handling expanding data volumes. One can either perform analytics on big volume once in days or one can perform transactions on small amounts of data in seconds. With the new requirements, one needs to ensure the real-time or near real-time response for huge amount of data. In this chapter we cover various important aspects of analyzing big data. We start with challenges one needs to overcome for moving data and data management applications. over cloud. For big data we describe two kinds of systems: (1) NoSQL systems for interactive data serving environments; and (2) systems for large scale analytics based on MapReduce paradigm, such as Hadoop, The NoSQL systems are designed to have a simpler key-value-based data model having inbuilt sharding , hence, these work seamlessly in a distributed cloud-based environment. In contrast, one can use Hadoop-based systems to run long running decision support and analytical queries consuming and possible producing bulk data. We illustrate various middleware and applications which can use these technologies to quickly process massive amount of data.

[1]  Kristina Chodorow,et al.  MongoDB: The Definitive Guide , 2010 .

[2]  Konstantina Palla A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework , 2009 .

[3]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[4]  R. Califf,et al.  Health Insurance Portability and Accountability Act (HIPAA): must there be a trade-off between privacy and quality of health care, or can we advance both? , 2003, Circulation.

[5]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[6]  Jianmin Wang,et al.  MapDupReducer: detecting near duplicates over massive datasets , 2010, SIGMOD Conference.

[7]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[8]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[9]  Hakan Hacigümüs,et al.  Providing database as a service , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Abhinav Srivastava,et al.  Credit Card Fraud Detection Using Hidden Markov Model , 2008, IEEE Transactions on Dependable and Secure Computing.

[11]  Mukesh K. Mohania,et al.  Enabling Active Data Archival over Cloud , 2012, 2012 IEEE Ninth International Conference on Services Computing.

[12]  Carlo Curino,et al.  Relational Cloud: The Case for a Database Service , 2010 .

[13]  Hans-Wolfgang Loidl,et al.  Comparing High Level MapReduce Query Languages , 2011, APPT.

[14]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[15]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[16]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[17]  Mukesh K. Mohania,et al.  Efficiently linking text documents with relevant structured information , 2006, VLDB.

[18]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[19]  Sergei Vassilvitskii,et al.  Densest Subgraph in Streaming and MapReduce , 2012, Proc. VLDB Endow..

[20]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[21]  Volker Markl,et al.  MapReduce and PACT - Comparing Data Parallel Programming Models , 2011, BTW.

[22]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[23]  Carlo Curino,et al.  Relational Cloud: a Database Service for the cloud , 2011, CIDR.

[24]  Raghunath Othayoth Nambiar,et al.  Why You Should Run TPC-DS: A Workload Analysis , 2007, VLDB.

[25]  Xuan Song,et al.  Accelerating Spatial Data Processing with MapReduce , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[26]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[27]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[28]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[29]  Feifei Li,et al.  Building Wavelet Histograms on Large Data in MapReduce , 2011, Proc. VLDB Endow..

[30]  Mukesh K. Mohania,et al.  Towards automatic association of relevant unstructured content with structured query results , 2005, CIKM '05.

[31]  Kevin Lee,et al.  Data Consistency Properties and the Trade-offs in Commercial Cloud Storage: the Consumers' Perspective , 2011, CIDR.

[32]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[33]  Eric A. Brewer,et al.  Cluster-based scalable network services , 1997, SOSP.

[34]  Moshe Y. Vardi The universal-relation data model for logical independence , 1988, IEEE Software.

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[37]  Dennis Shasha,et al.  The dangers of replication and a solution , 1996, SIGMOD '96.

[38]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).