论文信息 - D22HistoSketch: Discriminative and Dynamic Similarity-Preserving Sketching of Streaming Histograms

D22HistoSketch: Discriminative and Dynamic Similarity-Preserving Sketching of Streaming Histograms

Histogram-based similarity has been widely adopted in many machine learning tasks. However, measuring histogram similarity is a challenging task for streaming histograms, where the elements of a histogram are observed one after the other in an online manner. The ever-growing cardinality of histogram elements over the data streams makes any similarity computation inefficient in that case. To tackle this problem, we propose in this paper D$^2$2HistoSketch, a similarity-preserving sketching method for streaming histograms to efficiently approximate their Discriminative and Dynamic similarity. D$^2$2HistoSketch can fast and memory-efficiently maintain a set of compact and fixed-size sketches of streaming histograms to approximate the similarity between histograms. To provide high-quality similarity approximations, D$^2$2HistoSketch considers both discriminative and gradual forgetting weights for similarity measurement, and seamlessly incorporates them in the sketches. Based on both synthetic and real-world datasets, our empirical evaluation shows that our method is able to efficiently and effectively approximate the similarity between streaming histograms while outperforming state-of-the-art sketching methods. Compared to full streaming histograms with both discriminative and gradual forgetting weights in particular, D$^2$2HistoSketch is able to dramatically reduce the classification time (with a 7500x speedup) at the expense of a small loss in accuracy only (about 3.25 percent).

[1] Nicu Sebe,et al. Quantization-based hashing: a general framework for scalable image and video retrieval , 2018, Pattern Recognit..

[2] Larry S. Davis,et al. Efficient mean-shift tracking via a new similarity measure , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3] Yael Ben-Haim,et al. A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[4] Kunal Talwar,et al. Consistent Weighted Sampling , 2007 .

[5] Ralf Klinkenberg,et al. Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[6] Jingkuan Song,et al. Binary Generative Adversarial Networks for Image Retrieval , 2017, AAAI.

[7] Philip S. Yu,et al. On Classification of High-Cardinality Data Streams , 2010, SDM.

[8] Sergey Ioffe,et al. Improved Consistent Sampling, Weighted Minhash and L1 Sketching , 2010, 2010 IEEE International Conference on Data Mining.

[9] Patrick Haffner,et al. Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[10] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[11] Bernhard Haeupler,et al. Consistent Weighted Sampling Made Fast, Small, and Easy , 2014, ArXiv.

[12] Rasmus Pagh,et al. Efficient estimation for high similarities using odd sketches , 2014, WWW.

[13] Yehuda Koren,et al. Collaborative filtering with temporal dynamics , 2009, KDD.

[14] João Gama,et al. A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[15] Graham Cormode,et al. Approximating Data with the Count-Min Data Structure , 2011 .

[16] Matthew O. Ward,et al. Interactive Data Visualization - Foundations, Techniques, and Applications , 2010 .

[17] Chengqi Zhang,et al. Nested Subtree Hash Kernels for Large-Scale Graph Classification over Streams , 2012, 2012 IEEE 12th International Conference on Data Mining.

[18] Daqing Zhang,et al. Modeling User Activity Preference by Leveraging User Spatial Temporal Characteristics in LBSNs , 2015, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[19] Zhu Wang,et al. A sentiment-enhanced personalized location recommendation system , 2013, HT.

[20] David W. Aha,et al. A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[21] Daqing Zhang,et al. Fine-grained preference-aware location search leveraging crowdsourced digital footprints from LBSNs , 2013, UbiComp.

[22] Ping Li,et al. 0-Bit Consistent Weighted Sampling , 2015, KDD.

[23] Ronald L. Graham,et al. Concrete mathematics - a foundation for computer science , 1991 .

[24] Ely Porat,et al. Sketching Techniques for Collaborative Filtering , 2009, IJCAI.

[25] Andrew McCallum,et al. Distributional clustering of words for text classification , 1998, SIGIR '98.

[26] Nicu Sebe,et al. A Survey on Learning to Hash , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Nebojsa Jojic,et al. Discriminative Similarity for Clustering and Semi-Supervised Learning , 2017, ArXiv.

[28] Meng Wang,et al. Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[29] Bin Li,et al. HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[30] Chengqi Zhang,et al. Consistent Weighted Sampling Made More Practical , 2017, WWW.

[31] Mohamed Ahmed,et al. Weighted Similarity Estimation in Data Streams , 2015, CIKM.

[32] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[33] Bin Guo,et al. Personalized Travel Package With Multi-Point-of-Interest Recommendation Based on Crowdsourced User Footprints , 2016, IEEE Transactions on Human-Machine Systems.

[34] Daqing Zhang,et al. Participatory Cultural Mapping Based on Collective Behavior Data in Location-Based Social Networks , 2016, ACM Trans. Intell. Syst. Technol..

[35] Preslav Nakov,et al. Weight functions impact on LSA performance , 2001 .

[36] Philip S. Yu,et al. Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[37] Daqing Zhang,et al. NationTelescope: Monitoring and visualizing large-scale collective behavior in LBSNs , 2015, J. Netw. Comput. Appl..

[38] Ivan Koychev,et al. Gradual Forgetting for Adaptation to Concept Drift , 2000 .

[39] Ronald L. Graham,et al. Concrete Mathematics, a Foundation for Computer Science , 1991, The Mathematical Gazette.

[40] Alexey Tsymbal,et al. The problem of concept drift: definitions and related work , 2004 .

[41] Alan M. Frieze,et al. Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[42] Moses Charikar,et al. Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[43] Bin Li,et al. POISketch: Semantic Place Labeling over User Activity Streams , 2016, IJCAI.