Content Sifting Storage: Achieving Fast Read for Large-scale Image Dataset Analysis

Analyzing large-scale image dataset requires all images to be read from disks first, leading to high read latency. Therefore, we propose a Content Sifting Storage (CSS) system, which aims to reduce the read latency by only reading sifted relevant data. CSS generates embedded content metadata via deep learning and manages the metadata via Semantic Hamming Graph, which achieves fast read based on content similarity meeting the given analysis. Extensive experimental results on image datasets show that compared with conventional semantic storage systems, our CSS can greatly reduce the read latency by 82.21% to 94.8% with more than 98% recall rate.

[1]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[2]  Ke Zhou,et al.  Analysis and Management to Hash-Based Graph and Rank , 2019, APWeb/WAIM.

[3]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[4]  Shankar Pasupathy,et al.  Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems , 2009, FAST.

[5]  Badrish Chandramouli,et al.  FishStore: Faster Ingestion with Subset Hashing , 2019, SIGMOD Conference.

[6]  Ke Zhou,et al.  Efficient SSD Caching by Avoiding Unnecessary Writes using Machine Learning , 2018, ICPP.

[7]  Ke Zhou,et al.  Semantic-aware data quality assessment for image big data , 2020, Future Gener. Comput. Syst..

[8]  Ling Shao,et al.  Deep Self-Taught Hashing for Image Retrieval , 2019, IEEE Transactions on Cybernetics.

[9]  Ke Zhou,et al.  Transfer Learning based Failure Prediction for Minority Disks in Large Data Centers of Heterogeneous Disk Systems , 2019, ICPP.

[10]  Holger Voos,et al.  Graph-based software knowledge: Storage and semantic querying of domain models for run-time adaptation , 2016, 2016 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR).

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Jim Webber,et al.  A programmatic introduction to Neo4j , 2018, SPLASH '12.

[13]  Hong Jiang,et al.  SmartStore: a new metadata organization paradigm with semantic-awareness for next-generation file systems , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[16]  Hong Jiang,et al.  FAST: Near Real-Time Searchable Data Analytics for the Cloud , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.