Diamond: A Storage Architecture for Early Discard in Interactive Search

This paper explores the concept of early discard for interactive search of unindexed data. Processing data inside storage devices using downloaded searchlet code enables Diamond to perform efficient, application-specific filtering of large data collections. Early discard helps users who are looking for "needles in a haystack" by eliminating the bulk of the irrelevant items as early as possible. A searchlet consists of a set of application-generated filters that Diamond uses to determine whether an object may be of interest to the user. The system optimizes the evaluation order of the filters based on run-time measurements of each filter's selectivity and computational cost. Diamond can also dynamically partition computation between the storage devices and the host computer to adjust for changes in hardware and network conditions. Performance numbers show that Diamond dynamically adapts to a query and to run-time system state. An informal user study of an image retrieval application supports our belief that early discard significantly improves the quality of interactive searches.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  David K. Hsiao Data Base Machines are Coming, Data Base Machines are Coming! , 1979, Computer.

[3]  David K. Hsiao,et al.  Database Machines are Coming, Database Machines are Coming! , 1989 .

[4]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[5]  David J. DeWitt,et al.  Database Machines: An Idea Whose Time Passed? A Critique of the Future of Database Machines , 1989, IWDM.

[6]  Andrew Chi-Chih Yao,et al.  A general approach to d-dimensional geometric queries , 1985, STOC '85.

[7]  Robert Wahbe,et al.  Efficient software-based fault isolation , 1994, SOSP '93.

[8]  Frank Ruskey,et al.  Generating Linear Extensions Fast , 1994, SIAM J. Comput..

[9]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[10]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[11]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[12]  Rosalind W. Picard,et al.  Interactive Learning Using a "Society of Models" , 2017, CVPR 1996.

[13]  Tom Minka,et al.  Interactive learning with a "society of models" , 1997, Pattern Recognit..

[14]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[15]  David A. Patterson,et al.  A case for intelligent disks (IDISKs) , 1998, SGMD.

[16]  Christos Faloutsos,et al.  Active Storage for Large-Scale Data Mining and Multimedia , 1998, VLDB.

[17]  Joel H. Saltz,et al.  Active disks: programming model, algorithms and evaluation , 1998, ASPLOS VIII.

[18]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[19]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[20]  Galen C. Hunt,et al.  The Coign automatic distributed partitioning system , 1999, OSDI '99.

[21]  Noah Treuhaft,et al.  Cluster I/O with River: making the fast case common , 1999, IOPADS '99.

[22]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[23]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Gregory R. Ganger,et al.  Dynamic Function Placement for Data-Intensive Cluster Computing , 2000, USENIX Annual Technical Conference, General Track.

[25]  Ingemar J. Cox,et al.  The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments , 2000, IEEE Trans. Image Process..

[26]  Mahmut T. Kandemir,et al.  Design and evaluation of smart disk architecture for DSS commercial workloads , 2000, Proceedings 2000 International Conference on Parallel Processing.

[27]  Paul A. Viola,et al.  Boosting Image Retrieval , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[28]  Daniel P. Miranker,et al.  On indexing large databases for advanced data models , 2001 .

[29]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[30]  Jeffrey Scott Vitter,et al.  Distributed computing with load-managed active storage , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[31]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[32]  Lizy K. John,et al.  Improving Transaction Processing using a Hierarchical Computing Server , 2002 .

[33]  A. L. Narasimha Reddy,et al.  MVSS: An Active Storage Architecture , 2003, IEEE Trans. Parallel Distributed Syst..

[34]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[35]  K. Schulten,et al.  Mechanisms of selectivity in channels and enzymes studied with interactive molecular dynamics. , 2003, Biophysical journal.

[36]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[37]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.