Dependency-Driven Analytics: A Compass for Uncharted Data Oceans

In this paper, we predict the rise of Dependency-Driven Analytics (DDA), a new class of data analytics designed to cope with growing volumes of unstructured data. DDA drastically reduces the cognitive burden of data analysis by systematically leveraging a compact dependency graph derived from the raw data. The computational cost associated with the analysis is also reduced substantially, as the graph acts as an index for commonly accessed data items. We built a system supporting DDA using off-the-shelf Big Data and graph DB technologies, and deployed it in production at Microsoft to support the analysis of the exhaust of our Big Data infrastructure producing petabytes of system logs daily. The dependency graph in this setting captures lineage information among jobs and files and is used to guide the analysis of telemetry data. We qualitatively discuss the improvement over the brute-force analytics our users used to performed by considering a series of practical applications, including: job auditing and compliance, automated SLO extraction of recurring tasks, and global job ranking. We conclude by discussing the shortcomings of our current implementation and by presenting some of the open research challenges for Dependency-Driven Analytics that we plan to tackle next.

[1]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[2]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[3]  Gianluca Demartini,et al.  Effective named entity recognition for idiosyncratic web collections , 2014, WWW.

[4]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[5]  Aleksa Vukotic,et al.  Neo4j in Action , 2014 .

[6]  Carlo Curino,et al.  WOO: A Scalable and Multi-tenant Platform for Continuous Knowledge Base Synthesis , 2013, Proc. VLDB Endow..

[7]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[8]  Daniel Deutch,et al.  Putting Lipstick on Pig: Enabling Database-style Workflow Provenance , 2011, Proc. VLDB Endow..

[9]  Aditya G. Parameswaran,et al.  DataHub: Collaborative Data Science & Dataset Version Management at Scale , 2014, CIDR.

[10]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[11]  Kunle Olukotun,et al.  EmptyHeaded: A Relational Engine for Graph Processing , 2015, ACM Trans. Database Syst..

[12]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[13]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[14]  Margo I. Seltzer,et al.  Provenance for the Cloud , 2010, FAST.

[15]  Samuel Madden,et al.  Scorpion: Explaining Away Outliers in Aggregate Queries , 2013, Proc. VLDB Endow..

[16]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[17]  Reynold Xin,et al.  GraphFrames: an integrated API for mixing graph and relational queries , 2016, GRADES '16.

[18]  Alon Y. Halevy,et al.  Goods: Organizing Google's Datasets , 2016, SIGMOD Conference.

[19]  Michael Stonebraker,et al.  SubZero: A fine-grained lineage system for scientific databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.