Optimal "Big Data" Aggregation Systems - From Theory to Practical Application

Culhane, William John IV PhD, Purdue University, May 2015. Optimal “Big Data” Aggregation Systems – From Theory to Practical Application. Major Professor: Patrick Eugster. The integration of computers into many facets of our lives has made the collection and storage of staggering amounts of data feasible. However, the data on its own is not so useful to us as the analysis and manipulation which allows manageable descriptive information to be extracted. New tools to extract this information from ever growing repositories of data are required. Some of these analyses can take the form of a two phase problem which is easily distributed to take advantage of available computing power. The first phase involves computing some descriptive partial result from some subset of the original data, and the second phase involves aggregating all the partial results to create a combined output. We formalize this compute-aggregate model for a rigorous performance analysis in an effort to minimize the latency of the aggregation phase with minimal intrusive analysis or modification. Based on our model we find an aggregation overlay attribute which highly affects aggregation latency and its dependence on an easily findable trait of aggregation. We rigorously prove the dependence and find optimal overlays for aggregation. We use the proven optima to create simple heuristics and build a system, NOAH, to take advantage of the findings. NOAH can be used by big data analysis systems. We also study an individual problem, top-k matching, to explore the effects of optimizing the computation phase separately from aggregation and create a complete distributed system to fulfill an economically relevant task.

[1]  Ronald Fagin,et al.  A formula for incorporating weights into scoring rules , 2000, Theor. Comput. Sci..

[2]  Wei Hong,et al.  TinyDB: an acquisitional query processing system for sensor networks , 2005, TODS.

[3]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[4]  Martin Theobald,et al.  Top-k query processing in probabilistic databases with non-materialized views , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[5]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[6]  Fabian Kuhn,et al.  The Complexity of Data Aggregation in Directed Networks , 2011, DISC.

[7]  GhemawatSanjay,et al.  The Google file system , 2003 .

[8]  Michael Dahlin,et al.  Shruti: A Self-Tuning Hierarchical Aggregation System , 2007, First International Conference on Self-Adaptive and Self-Organizing Systems (SASO 2007).

[9]  Alvin AuYoung,et al.  Using R for Iterative and Incremental Processing , 2012, HotCloud.

[10]  Ian Sommerville,et al.  The Cloud Adoption Toolkit: supporting cloud adoption decisions in the enterprise , 2010, Softw. Pract. Exp..

[11]  Thomas G. Robertazzi,et al.  Distributed computation for a tree network with communication delays , 1990 .

[12]  Jeffrey Dean,et al.  Designs, Lessons and Advice from Building Large Distributed Systems , 2009 .

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[15]  Dhabaleswar K. Panda,et al.  Virtual machine aware communication libraries for high performance computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[16]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[17]  Hyoung Joong Kim,et al.  Optimal load distribution for tree network processors , 1996 .

[18]  Patrick Th. Eugster,et al.  Optimal communication structures for big data aggregation , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[19]  Avi Goldfarb,et al.  Online advertising, behavioral targeting, and privacy , 2011, Commun. ACM.

[20]  Pramod Bhatotia,et al.  Incoop: MapReduce for incremental computations , 2011, SoCC.

[21]  Scott Shenker,et al.  Querying the Internet with PIER , 2003, VLDB.

[22]  Antony I. T. Rowstron,et al.  Symbiotic routing in future data centers , 2010, SIGCOMM '10.

[23]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[24]  Gunther H. Weber,et al.  Distributed merge trees , 2013, PPoPP '13.

[25]  Patrick Th. Eugster,et al.  LOOM: Optimal Aggregation Overlays for In-Memory Big Data Processing , 2014, HotCloud.

[26]  Eric A. Brewer,et al.  Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[27]  Hans-Arno Jacobsen,et al.  Relevance Matters: Capitalizing on Less (Top-k Matching in Publish/Subscribe) , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[28]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[29]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[30]  Jon Feldman,et al.  Online allocation of display ads with smooth delivery , 2012, KDD.

[31]  Jim Waldo,et al.  A Note on Distributed Computing , 1996, Mobile Object Systems.

[32]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[33]  Haitao Wu,et al.  ICTCP: Incast Congestion Control for TCP in Data-Center Networks , 2010, IEEE/ACM Transactions on Networking.

[34]  Robbert van Renesse,et al.  Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining , 2003, TOCS.

[35]  Fabian Kuhn,et al.  The communication complexity of distributed task allocation , 2012, PODC '12.

[36]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[37]  Jang Gyu Lee,et al.  Optimal load distribution for tree network processors , 1996, IEEE Transactions on Aerospace and Electronic Systems.

[38]  Hans-Arno Jacobsen,et al.  BE-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space , 2011, SIGMOD '11.

[39]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[40]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[41]  Leandros Tassiulas,et al.  Energy conserving routing in wireless ad-hoc networks , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[42]  Jeffrey Scott Vitter,et al.  Optimal dynamic interval management in external memory , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[43]  Yin Zhang,et al.  STAR: Self-Tuning Aggregation for Scalable Monitoring , 2007, VLDB.

[44]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[45]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[46]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[47]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[48]  Zheng Li,et al.  Top-K oracle: A new way to present top-k tuples for uncertain data , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[49]  Wen Zhang,et al.  How much can behavioral targeting help online advertising? , 2009, WWW '09.

[50]  Praveen Yalagandula,et al.  A scalable distributed information management system , 2004, SIGCOMM 2004.

[51]  Antony I. T. Rowstron,et al.  Camdoop: Exploiting In-network Aggregation for Big Data Applications , 2012, NSDI.

[52]  Ashwin Machanavajjhala,et al.  Scalable ranked publish/subscribe , 2008, Proc. VLDB Endow..

[53]  M. Abadi,et al.  Naiad: a timely dataflow system , 2013, SOSP.

[54]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[55]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[56]  Praveen Yalagandula SDIMS: A Scalable Distributed Information Management System , 2004 .

[57]  Christian S. Jensen,et al.  Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems , 2013, 2013 IEEE 14th International Conference on Mobile Data Management.

[58]  Michael Isard,et al.  Distributed aggregation for data-parallel computing: interfaces and implementations , 2009, SOSP '09.

[59]  Yan Zhang,et al.  On Architecture Design, Congestion Notification, TCP Incast and Power Consumption in Data Centers , 2013, IEEE Communications Surveys & Tutorials.

[60]  Marcos K. Aguilera,et al.  Matching events in a content-based subscription system , 1999, PODC '99.

[61]  Ibrahim Korpeoglu,et al.  Power efficient data gathering and aggregation in wireless sensor networks , 2003, SGMD.

[62]  Yin Zhang,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 87 Network Imprecision: a New Consistency Metric for Scalable Monitoring , 2022 .

[63]  Shu-Chuan Chu Viral Advertising in Social Media , 2011 .