论文信息 - The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

Any paper proposing a new algorithm should come with an evaluation of efficiency and scalability (particularly when we are designing methods for “big data”). However, there are several (more or less serious) pitfalls in such evaluations. We would like to point the attention of the community to these pitfalls. We substantiate our points with extensive experiments, using clustering and outlier detection methods with and without index acceleration. We discuss what we can learn from evaluations, whether experiments are properly designed, and what kind of conclusions we should avoid. We close with some general recommendations but maintain that the design of fair and conclusive experiments will always remain a challenge for researchers and an integral part of the scientific endeavor.

[1] Nenad Tomašev. hubminer: Hub Miner v1.1 , 2015 .

[2] A. Zimek,et al. On Using Class-Labels in Evaluation of Clusterings , 2010 .

[3] R Core Team,et al. R: A language and environment for statistical computing. , 2014 .

[4] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[5] Elke Achtert,et al. ELKI: A Software System for Evaluation of Subspace Clustering Algorithms , 2008, SSDBM.

[6] Robin Sibson,et al. SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[7] Pavel Zezula,et al. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[8] Arnold W. M. Smeulders,et al. The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[9] Joost N. Kok,et al. Frequent subgraph miners: runtimes don't say everything , 2006 .

[10] San Cristóbal Mateo,et al. The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[11] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12] Steven J. Phillips. Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[13] Christian S. Jensen,et al. Spatial Joins in Main Memory: Implementation Matters! , 2014, Proc. VLDB Endow..

[14] Elke Achtert,et al. ELKI in Time: ELKI 0.2 for the Performance Evaluation of Distance Measures for Time Series , 2009, SSTD.

[15] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[16] Antonio Gomariz,et al. SPMF: a Java open-source pattern mining library , 2014, J. Mach. Learn. Res..

[17] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18] Hans-Peter Kriegel,et al. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[19] Hans-Peter Kriegel,et al. Generalized Outlier Detection with Flexible Kernel Density Estimates , 2014, SDM.

[20] Elke Achtert,et al. Evaluation of Clusterings -- Metrics and Visual Support , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[21] Gunnar Rätsch,et al. The SHOGUN Machine Learning Toolbox , 2010, J. Mach. Learn. Res..

[22] Greg Hamerly,et al. Accelerating Lloyd’s Algorithm for k -Means Clustering , 2015 .

[23] อนิรุธ สืบสิงห์,et al. Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[24] Mario A. López,et al. STR: a simple and efficient algorithm for R-tree packing , 1997, Proceedings 13th International Conference on Data Engineering.

[25] Ferenc Bodon,et al. A fast APRIORI implementation , 2003, FIMI.

[26] Andrew W. Moore,et al. Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[27] Jon Louis Bentley,et al. Multidimensional binary search trees used for associative searching , 1975, CACM.

[28] Daniel Müllner,et al. Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[29] Krista Rizman Zalik,et al. An efficient k 0-means clustering algorithm , 2008 .

[30] Alan Edelman,et al. Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[31] Johannes Gehrke,et al. An Experimental Analysis of Iterated Spatial Joins in Main Memory , 2013, Proc. VLDB Endow..

[32] P. Sneath,et al. Some thoughts on bacterial classification. , 1957, Journal of general microbiology.

[33] Arthur Zimek,et al. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[34] Elke Achtert,et al. Interactive data mining with 3D-parallel-coordinate-trees , 2013, SIGMOD '13.

[35] Christos Faloutsos,et al. Halite: Fast and Scalable Multiresolution Local-Correlation Clustering , 2013, IEEE Transactions on Knowledge and Data Engineering.

[36] Elke Achtert,et al. Visual Evaluation of Outlier Detection Models , 2010, DASFAA.

[37] John A. Hartigan,et al. Clustering Algorithms , 1975 .

[38] Bart Goethals,et al. FIMI '03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003, Melbourne, Florida, USA , 2003, FIMI.

[39] Khaled Mahar,et al. Using grid for accelerating density-based clustering , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[40] Hans-Peter Kriegel,et al. The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[41] Yufei Tao,et al. DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[42] P. Sneath. The application of computers to taxonomy. , 1957, Journal of general microbiology.

[43] Koby Crammer,et al. Hartigan's K-Means Versus Lloyd's K-Means - Is It Time for a Change? , 2013, IJCAI.

[44] David Eppstein,et al. Fast hierarchical clustering and other applications of dynamic closest pairs , 1999, SODA '98.

[45] Ian Witten,et al. Data Mining , 2000 .

[46] John Langford,et al. Cover trees for nearest neighbor , 2006, ICML.

[47] Greg Hamerly,et al. Making k-means Even Faster , 2010, SDM.

[48] Christian Borgelt,et al. EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[49] Hans-Hermann Bock,et al. Clustering Methods: A History of k-Means Algorithms , 2007 .

[50] Hans-Peter Kriegel,et al. Density‐based clustering , 2011, WIREs Data Mining Knowl. Discov..

[51] Hans-Peter Kriegel,et al. Geodetic Distance Queries on R-Trees for Indexing Geographic Data , 2013, SSTD.