A Link is not Enough – Reproducibility of Data

Although many works in the database community use open data in their experimental evaluation, repeating the empirical results of previous works remains a challenge. This holds true even if the source code or binaries of the tested algorithms are available. In this paper, we argue that providing access to the raw, original datasets is not enough. Real-world datasets are rarely processed without modification. Instead, the data is adapted to the needs of the experimental evaluation in the data preparation process. We showcase that the details of the data preparation process matter and subtle differences during data conversion can have a large impact on the outcome of runtime results. We introduce a data reproducibility model, identify three levels of data reproducibility, report about our own experience, and exemplify our best practices.

[1]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[2]  Fei Li,et al.  A survey on tree edit distance lower bound estimation techniques for similarity join on XML data , 2014, SGMD.

[3]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[4]  Sudipto Guha,et al.  Approximate XML joins , 2002, SIGMOD '02.

[5]  Nikolaus Augsten,et al.  A Scalable Index for Top-k Subtree Similarity Queries , 2019, SIGMOD Conference.

[6]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[7]  Michael Haubenschild,et al.  Get Real: How Benchmarks Fail to Represent the Real World , 2018, DBTest@SIGMOD.

[8]  Jinfeng Li,et al.  Reachability and time-based path queries in temporal graphs , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[9]  Christian S. Collberg,et al.  Repeatability in computer systems research , 2016, Commun. ACM.

[10]  Jeffrey Xu Yu,et al.  Persistent Community Search in Temporal Networks , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[11]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[12]  Richard T. Snodgrass,et al.  Metrology : Measuring Query Time SABAH , 2016 .

[13]  Sara Cohen Indexing for subtree similarity-search using edit distance , 2013, SIGMOD '13.

[14]  Thomas Ludwig,et al.  Reproduzierbarkeit , 2019, Informatik Spektrum.

[15]  Nikolaus Augsten,et al.  Effective Filters and Linear Time Verification for Tree Similarity Joins , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[16]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[17]  Nikos Mamoulis,et al.  Scaling Similarity Joins over Tree-Structured Data , 2015, Proc. VLDB Endow..

[18]  Theo Härder,et al.  Evaluating Performance and Quality of XML-Based Similarity Joins , 2008, ADBIS.

[19]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[20]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.