Improving Large-Scale Fingerprint-Based Queries in Distributed Infrastructure

Fingerprints are often used in a sketching mechanism, which maps elements into concise and representative synopsis using small space. Large-scale fingerprint-based query can be used as an important tool in big data analytics, such as set membership query, rank-based query and correlationship query etc. In this paper, we propose an efficient approach to improving the performance of large-scale fingerprint-based queries in a distributed infrastructure. At initial stage of the queries, we first transform the fingerprints sketch into space constrained global rank-based sketch at query site via collecting minimal information from local sites. The time-consuming operations, such as local fingerprints construction and searching, are pushed down into local sites. The proposed approach can construct large-scale and scalable fingerprints efficiently and dynamically, meanwhile it can also supervise continuous queries by utilizing the global sketch, and run an appropriate number of jobs over distributed computing environments. We implement our approach in Spark, and evaluate its performance over real-world datasets. When compared with native SparkSQL, our approach outperforms the native routines on query response time by 2 orders of magnitude.

[1]  Keqin Li,et al.  FastRAQ: A Fast Approach to Range-Aggregate Queries in Big Data Environments , 2015, IEEE Transactions on Cloud Computing.

[2]  Jie Wu,et al.  The Dynamic Bloom Filters , 2010, IEEE Transactions on Knowledge and Data Engineering.

[3]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[4]  Gilad Mishne,et al.  Fast data in the era of big data: Twitter's real-time related query suggestion architecture , 2012, SIGMOD '13.

[5]  Jun Ma,et al.  Learning to recommend with multi-faceted trust in social networks , 2013, WWW '13 Companion.

[6]  Daniele Quercia,et al.  Reading tweeting minds: real-time analysis of short text for computational social science , 2013, HT '13.

[7]  H. Stanley,et al.  Quantifying Trading Behavior in Financial Markets Using Google Trends , 2013, Scientific Reports.

[8]  Divyakant Agrawal,et al.  Medians and beyond: new aggregation techniques for sensor networks , 2004, SenSys '04.

[9]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[10]  Jie Wu,et al.  Theory and Network Applications of Dynamic Bloom Filters , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[11]  Chao Li,et al.  Supporting Real-Time Analytic Queries in Big and Fast Data Environments , 2017, DASFAA.

[12]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[13]  Alexandros Labrinidis,et al.  CE-Storm: Confidential Elastic Processing of Data Streams , 2015, SIGMOD Conference.

[14]  Robert Fernholz,et al.  Universality of Zipf's Law for Time-Dependent Rank-Based Systems , 2017 .