Eventually consistent cardinality estimation with applications in biodata mining

Large set cardinality estimators and other streaming oriented operations are the cornerstone of big data processing. Cardinality estimators combined with in-memory based storage systems provide a fast framework for keeping valuable application data easily queryable and maintanable. This has a plethora of applications. For instance, a common use case is to maintain a number of counters for monitoring application statistics for real time dashboard purposes. Another such case is large set size estimation for big data systems in internal operations like counting. In this paper is addressed the issue of scaling the computation of a cardinality estimator in the presence of node failures in a distributed setting. Moreover, for the proposed estimation technique eventual consistency is proved, which is adequate for most cases in distributed applications. To the best of the authors knowledge, this functionality is not currently provided by commonly used commercial and open source systems. Additionally, the proposed approach is generic enough to be applied to other algorithms, which can help build a basic framework for more complex operations in the big data field. We demonstrate this with graph metric calculation applications in the large scale biodata mining field.

[1]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[2]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[3]  Dominique Gaïti,et al.  Decentralized Aggregation Protocols in Peer-to-Peer Networks: A Survey , 2009, MACE.

[4]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[5]  Donovan H Parks,et al.  Measuring community similarity with phylogenetic networks. , 2012, Molecular biology and evolution.

[6]  S. Evans,et al.  The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[7]  Joan Feigenbaum,et al.  Massive data streams in graph theory and computational geometry , 2005 .

[8]  Stephen R. Marsland,et al.  Machine Learning - An Algorithmic Perspective , 2009, Chapman and Hall / CRC machine learning and pattern recognition series.

[9]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[10]  Shu Yun Wang,et al.  Finding Frequent Items in SlidingWindows over Data Streams Using EBF , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[11]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[12]  Andrew McGregor,et al.  Graph stream algorithms: a survey , 2014, SGMD.

[13]  Robert D. Leclerc Survival of the sparsest: robust gene networks are parsimonious , 2008, Molecular systems biology.

[14]  Siyuan Ma,et al.  A Survey on Failure Prediction of Large-Scale Server Clusters , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[15]  Reinhard Schneider,et al.  Using graph theory to analyze biological networks , 2011, BioData Mining.

[16]  Gerhard Weikum,et al.  Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  Stephen Marsland,et al.  Machine Learning: An Algorithmic Perspective, Second Edition , 2014 .

[18]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[19]  João Leitão,et al.  Epidemic Broadcast Trees , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[20]  Jeffrey Scott Vitter External memory algorithms , 1998, PODS '98.