Optimizing Intelligent Reduction Techniques for Big Data

Working with big volume of data collected through many applications in multiple storage locations is both challenging and rewarding. Extracting valuable information from data means to combine qualitative and quantitative analysis techniques. One of the main promises of analytics is data reduction with the primary function to support decision-making. The motivation of this chapter comes from the new age of applications (social media, smart cities, cyber-infrastructures, environment monitoring and control, healthcare, etc.), which produce big data and many new mechanisms for data creation rather than a new mechanism for data storage. The goal of this chapter is to analyze existing techniques for data reduction, at scale to facilitate Big Data processing optimization and understanding. The chapter will cover the following subjects: data manipulation, analytics and Big Data reduction techniques considering descriptive analytics, predictive analytics and prescriptive analytics. The CyberWater case study will be presented by referring to: optimization process, monitoring, analysis and control of natural resources, especially water resources to preserve the water quality.

[1]  I. Song,et al.  Analytics over large-scale multidimensional data: the big data revolution! , 2011, DOLAP '11.

[2]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Erik Cambria,et al.  Big Social Data Analysis , 2013 .

[4]  Mevin B. Hooten,et al.  Computationally Efficient Statistical Differential Equation Modeling Using Homogenization , 2013 .

[5]  Ron Kimmel,et al.  Spectral multidimensional scaling , 2013, Proceedings of the National Academy of Sciences.

[6]  Christopher K. Wikle,et al.  Ecological Prediction With Nonlinear Multivariate Time-Frequency Functional Data Models , 2013 .

[7]  Martin Hilbert,et al.  The World’s Technological Capacity to Store, Communicate, and Compute Information , 2011, Science.

[8]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[9]  Parag Agrawal,et al.  On indexing error-tolerant set containment , 2010, SIGMOD Conference.

[10]  Yuan Yuan,et al.  Major technical advancements in apache hive , 2014, SIGMOD Conference.

[11]  Donald K. Burleson,et al.  Oracle Data Mining: Mining Gold from Your Warehouse (Oracle In-Focus series) , 2006 .

[12]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[13]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[14]  Sean Owen,et al.  Mahout in Action , 2011 .

[15]  Michael L. Brodie,et al.  The meaningful use of big data: four perspectives -- four challenges , 2012, SGMD.

[16]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[17]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[18]  E. Birney The making of ENCODE: Lessons for big-data projects , 2012, Nature.

[19]  David I. Warton,et al.  Finite Mixture of Regression Modeling for High-Dimensional Count and Biomass Data in Ecology , 2013 .

[20]  Mariana Mocanu,et al.  Distributed Cyberinfrastructure for Decision Support in Risk Related Environments , 2013, 2013 IEEE 12th International Symposium on Parallel and Distributed Computing.

[21]  Christopher K. Wikle,et al.  Hierarchical Bayesian Spatio-Temporal Conway–Maxwell Poisson Models with Dynamic Dispersion , 2013 .

[22]  R. Towell,et al.  Bayesian Clustering of Animal Abundance Trends for Inference and Dimension Reduction , 2013 .

[23]  Imad Aad,et al.  The Mobile Data Challenge: Big Data for Mobile Computing Research , 2012 .

[24]  Alan E. Gelfand,et al.  Spatial Regression Modeling for Compositional Data With Many Zeros , 2013 .

[25]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[26]  Yifeng Jiang HBase Administration Cookbook , 2012 .

[27]  Jing Li,et al.  Energy Efficient Cloud Storage Service: Key Issues and Challenges , 2013, 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies.

[28]  C. Lynch Big data: How do your data grow? , 2008, Nature.

[29]  Cindy X. Chen Spatio-temporal Databases , 2008, Encyclopedia of GIS.

[30]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[31]  Ahmed E. Hassan,et al.  Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report , 2012, J. Syst. Softw..

[32]  Raghu Ramakrishnan,et al.  Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[33]  David Loshin Chapter 7 – Big Data Tools and Techniques , 2013 .

[34]  Ron Kimmel,et al.  Scale Invariant Geometry for Nonrigid Shapes , 2013, SIAM J. Imaging Sci..

[35]  Jim Gray,et al.  2020 Computing: Science in an exponential world , 2006, Nature.

[36]  David Loshin Chapter 9 – NoSQL Data Management for Big Data , 2013 .

[37]  John F. Roddick,et al.  Spatial, temporal and spatio-temporal databases - hot issues and directions for phd research , 2004, SGMD.

[38]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[39]  Andrew O. Finley,et al.  Modeling Complex Spatial Dependencies: Low-Rank Spatially Varying Cross-Covariances With Application to Soil Nutrient Data , 2013 .

[40]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[41]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[42]  Jian Pei,et al.  A spatiotemporal compression based approach for efficient big data processing on Cloud , 2014, J. Comput. Syst. Sci..

[43]  Samuel Madden,et al.  From Databases to Big Data , 2012, IEEE Internet Comput..

[44]  Ciprian Dobre,et al.  Adaptive method to support social‐based mobile networks using a pagerank approach , 2015, Concurr. Comput. Pract. Exp..

[45]  A. Gelfand,et al.  Gaussian predictive process models for large spatial data sets , 2008, Journal of the Royal Statistical Society. Series B, Statistical methodology.