D22HistoSketch: Discriminative and Dynamic Similarity-Preserving Sketching of Streaming Histograms

Histogram-based similarity has been widely adopted in many machine learning tasks. However, measuring histogram similarity is a challenging task for streaming histograms, where the elements of a histogram are observed one after the other in an online manner. The ever-growing cardinality of histogram elements over the data streams makes any similarity computation inefficient in that case. To tackle this problem, we propose in this paper D$^2$2HistoSketch, a similarity-preserving sketching method for streaming histograms to efficiently approximate their Discriminative and Dynamic similarity. D$^2$2HistoSketch can fast and memory-efficiently maintain a set of compact and fixed-size sketches of streaming histograms to approximate the similarity between histograms. To provide high-quality similarity approximations, D$^2$2HistoSketch considers both discriminative and gradual forgetting weights for similarity measurement, and seamlessly incorporates them in the sketches. Based on both synthetic and real-world datasets, our empirical evaluation shows that our method is able to efficiently and effectively approximate the similarity between streaming histograms while outperforming state-of-the-art sketching methods. Compared to full streaming histograms with both discriminative and gradual forgetting weights in particular, D$^2$2HistoSketch is able to dramatically reduce the classification time (with a 7500x speedup) at the expense of a small loss in accuracy only (about 3.25 percent).

[1]  Nicu Sebe,et al.  Quantization-based hashing: a general framework for scalable image and video retrieval , 2018, Pattern Recognit..

[2]  Larry S. Davis,et al.  Efficient mean-shift tracking via a new similarity measure , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[4]  Kunal Talwar,et al.  Consistent Weighted Sampling , 2007 .

[5]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[6]  Jingkuan Song,et al.  Binary Generative Adversarial Networks for Image Retrieval , 2017, AAAI.

[7]  Philip S. Yu,et al.  On Classification of High-Cardinality Data Streams , 2010, SDM.

[8]  Sergey Ioffe,et al.  Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[9]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[10]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[11]  Bernhard Haeupler,et al.  Consistent Weighted Sampling Made Fast, Small, and Easy , 2014, ArXiv.

[12]  Rasmus Pagh,et al.  Efficient estimation for high similarities using odd sketches , 2014, WWW.

[13]  Yehuda Koren,et al.  Collaborative filtering with temporal dynamics , 2009, KDD.

[14]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[15]  Graham Cormode,et al.  Approximating Data with the Count-Min Data Structure , 2011 .

[16]  Matthew O. Ward,et al.  Interactive Data Visualization - Foundations, Techniques, and Applications , 2010 .

[17]  Chengqi Zhang,et al.  Nested Subtree Hash Kernels for Large-Scale Graph Classification over Streams , 2012, 2012 IEEE 12th International Conference on Data Mining.

[18]  Daqing Zhang,et al.  Modeling User Activity Preference by Leveraging User Spatial Temporal Characteristics in LBSNs , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[19]  Zhu Wang,et al.  A sentiment-enhanced personalized location recommendation system , 2013, HT.

[20]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[21]  Daqing Zhang,et al.  Fine-grained preference-aware location search leveraging crowdsourced digital footprints from LBSNs , 2013, UbiComp.

[22]  Ping Li,et al.  0-Bit Consistent Weighted Sampling , 2015, KDD.

[23]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .

[24]  Ely Porat,et al.  Sketching Techniques for Collaborative Filtering , 2009, IJCAI.

[25]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[26]  Nicu Sebe,et al.  A Survey on Learning to Hash , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Nebojsa Jojic,et al.  Discriminative Similarity for Clustering and Semi-Supervised Learning , 2017, ArXiv.

[28]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[29]  Bin Li,et al.  HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[30]  Chengqi Zhang,et al.  Consistent Weighted Sampling Made More Practical , 2017, WWW.

[31]  Mohamed Ahmed,et al.  Weighted Similarity Estimation in Data Streams , 2015, CIKM.

[32]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[33]  Bin Guo,et al.  Personalized Travel Package With Multi-Point-of-Interest Recommendation Based on Crowdsourced User Footprints , 2016, IEEE Transactions on Human-Machine Systems.

[34]  Daqing Zhang,et al.  Participatory Cultural Mapping Based on Collective Behavior Data in Location-Based Social Networks , 2016, ACM Trans. Intell. Syst. Technol..

[35]  Preslav Nakov,et al.  Weight functions impact on LSA performance , 2001 .

[36]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[37]  Daqing Zhang,et al.  NationTelescope: Monitoring and visualizing large-scale collective behavior in LBSNs , 2015, J. Netw. Comput. Appl..

[38]  Ivan Koychev,et al.  Gradual Forgetting for Adaptation to Concept Drift , 2000 .

[39]  Ronald L. Graham,et al.  Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[40]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[41]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[42]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[43]  Bin Li,et al.  POISketch: Semantic Place Labeling over User Activity Streams , 2016, IJCAI.