Orderings of Data - More Than a Tripping Hazard: Visionary

As data processing techniques get more and more sophisticated every day, many of us researchers often get lost in the details and subtleties of the algorithms we are developing and far too easily seem to forget to look also at the very first steps of every algorithm: the input of the data. Since there are plenty of library functions for this task, we indeed do not have to think about this part of the pipeline anymore. But maybe we should. All data is stored and loaded into a program in some order. In this vision paper we study how ignoring this order can (1) lead to performance issues and (2) make research results unreproducible. We furthermore examine desirable properties of a data ordering and why current approaches are often not suited to tackle the two mentioned problems.

[1]  Anna Beer,et al.  Graph Ordering and Clustering: A Circular Approach , 2019, SSDBM.

[2]  Konstantin Avrachenkov,et al.  Cooperative Game Theory Approaches for Network Partitioning , 2017, COCOON.

[3]  D. Hilbert,et al.  Ueber die reellen Züge algebraischer Curven , 1891 .

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  Judd Harrison Michael,et al.  Labor dispute reconciliation in a forest products manufacturing facility , 1997 .

[6]  Samuel DeFazio,et al.  Locality of Reference in Hierarchical Database Systems , 1983, IEEE Transactions on Software Engineering.

[7]  Eleni Petraki,et al.  Holistic Indexing in Main-memory Column-stores , 2015, SIGMOD Conference.

[8]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[9]  Vladimir Batagelj,et al.  Exploratory Social Network Analysis with Pajek , 2005 .

[10]  Peter J. Denning,et al.  The locality principle , 2005, CACM.

[11]  Alfons Kemper,et al.  Locality-sensitive operators for parallel main-memory database clusters , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[12]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[13]  Xuemin Lin,et al.  Speedup Graph Processing by Graph Ordering , 2016, SIGMOD Conference.

[14]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[15]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[16]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[17]  Elena Marchiori,et al.  Axioms for graph clustering quality functions , 2013, J. Mach. Learn. Res..

[18]  Xiaodong Zhang,et al.  Understanding intrinsic characteristics and system implications of flash memory based solid state drives , 2009, SIGMETRICS '09.

[19]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[20]  Paul Watson Databases in Grid Applications: Locality and Distribution , 2005, BNCOD.

[21]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[22]  Je-Min Kim,et al.  AndroBench: Benchmarking the Storage Performance of Android-Based Mobile Devices , 2011, ICFCE.

[23]  W. Zachary,et al.  An Information Flow Model for Conflict and Fission in Small Groups , 1977, Journal of Anthropological Research.

[24]  Javier A Barria Communication networks and computer systems : a tribute to Professor Erol Gelenbe , 2006 .