LAMP: data provenance for graph based machine learning algorithms through derivative computation

Data provenance tracking determines the set of inputs related to a given output. It enables quality control and problem diagnosis in data engineering. Most existing techniques work by tracking program dependencies. They cannot quantitatively assess the importance of related inputs, which is critical to machine learning algorithms, in which an output tends to depend on a huge set of inputs while only some of them are of importance. In this paper, we propose LAMP, a provenance computation system for machine learning algorithms. Inspired by automatic differentiation (AD), LAMP quantifies the importance of an input for an output by computing the partial derivative. LAMP separates the original data processing and the more expensive derivative computation to different processes to achieve cost-effectiveness. In addition, it allows quantifying importance for inputs related to discrete behavior, such as control flow selection. The evaluation on a set of real world programs and data sets illustrates that LAMP produces more precise and succinct provenance than program dependence based techniques, with much less overhead. Our case studies demonstrate the potential of LAMP in problem diagnosis in data engineering.

[1]  Vince Grolmusz,et al.  A note on the PageRank of undirected graphs , 2012, Inf. Process. Lett..

[2]  Gershon Kedem,et al.  Automatic Differentiation of Computer Programs , 1980, TOMS.

[3]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[4]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[5]  Diyi Yang,et al.  Combining Factorization Model and Additive Forest for Collaborative Followee Recommendation , 2012 .

[6]  Miryung Kim,et al.  Titian: Data Provenance Support in Spark , 2015, Proc. VLDB Endow..

[7]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[8]  Moritz Diehl,et al.  CasADi -- A symbolic package for automatic differentiation and optimal control , 2012 .

[9]  Louis B. Rall,et al.  Automatic Differentiation: Techniques and Applications , 1981, Lecture Notes in Computer Science.

[10]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[11]  Neelam Tyagi Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page , 2012 .

[12]  P Haddawy,et al.  Construction of a Bayesian network for mammographic diagnosis of breast cancer , 1997, Comput. Biol. Medicine.

[13]  Baowen Xu,et al.  Testing and validating machine learning classifiers by metamorphic testing , 2011, J. Syst. Softw..

[14]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[15]  Soumen Chakrabarti,et al.  Dynamic personalized pagerank in entity-relation graphs , 2007, WWW '07.

[16]  David S. Rosenblum,et al.  Known unknowns: testing in the presence of uncertainty , 2014, SIGSOFT FSE.

[17]  David F. Gleich,et al.  Models and algorithms for pagerank sensitivity , 2009 .

[18]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[19]  Gail E. Kaiser,et al.  Metamorphic testing techniques to detect defects in applications without test oracles , 2010 .

[20]  John Sibert,et al.  AD Model Builder: using automatic differentiation for statistical inference of highly parameterized complex nonlinear models , 2012, Optim. Methods Softw..

[21]  Fei Wang,et al.  Social contextual recommendation , 2012, CIKM.

[22]  Krishna P. Gummadi,et al.  Understanding and combating link farming in the twitter social network , 2012, WWW.

[23]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[24]  Hai Jin,et al.  Lifetime-Based Memory Management for Distributed Data Processing Systems , 2016, Proc. VLDB Endow..

[25]  Andreas Griewank,et al.  Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation , 1992 .

[26]  Alejandro Russo,et al.  A Taint Mode for Python via a Library , 2010, NordSec.

[27]  Baowen Xu,et al.  Python predictive analysis for bug detection , 2016, SIGSOFT FSE.

[28]  Glenn Shafer,et al.  Readings in Uncertain Reasoning , 1990 .

[29]  Chongchong Zhao,et al.  Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Pages in Time Duration , 2014 .

[30]  Miryung Kim,et al.  BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[31]  Andreas Griewank,et al.  Algorithm 755: ADOL-C: a package for the automatic differentiation of algorithms written in C/C++ , 1996, TOMS.

[32]  G. Golub,et al.  An Arnoldi-type algorithm for computing page rank , 2006 .

[33]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[34]  Ramesh Govindan,et al.  Making Eigenvector-Based Reputation Systems Robust to Collusion , 2004, WAW.

[35]  Ken Yocum,et al.  Scalable lineage capture for debugging DISC analytics , 2013, SoCC.

[36]  T. Kadir,et al.  Bayesian Networks for Clinical Decision Support in Lung Cancer Care , 2013, PloS one.

[37]  Neelam Duhan,et al.  Page ranking based on number of visits of links of Web page , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).

[38]  Hung-Hsuan Chen,et al.  ASCOS: An Asymmetric network Structure COntext Similarity measure , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[39]  Ting Jiang,et al.  Enhancing Least Square Support Vector Regression with Gradient Information , 2014, Neural Processing Letters.

[40]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[41]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[42]  Griewank,et al.  On automatic differentiation , 1988 .

[43]  Taher H. Haveliwala,et al.  Adaptive methods for the computation of PageRank , 2004 .

[44]  Yi Lu,et al.  Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[45]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[46]  Wenpu Xing,et al.  Weighted PageRank algorithm , 2004, Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004..

[47]  伊理 正夫,et al.  Mathematical programming : recent developments and applications , 1989 .

[48]  Jennifer Widom,et al.  Provenance for Generalized Map and Reduce Workflows , 2011, CIDR.

[49]  Fan Yang,et al.  Husky: Towards a More Efficient and Expressive Distributed Computing Framework , 2016, Proc. VLDB Endow..

[50]  Christian H. Bischof,et al.  ADIC: an extensible automatic differentiation tool for ANSI‐C , 1997, Softw. Pract. Exp..