DART: distributed adaptive radix tree for efficient affix-based keyword search on HPC systems

Affix-based search is a fundamental functionality for storage systems. It allows users to find desired datasets, where attributes of a dataset match an affix. While building inverted index to facilitate efficient affix-based keyword search is a common practice for standalone databases and for desktop file systems, building local indexes or adopting indexing techniques used in a standalone data store is insufficient for high-performance computing (HPC) systems due to the massive amount of data and distributed nature of the storage devices within a system. In this paper, we propose Distributed Adaptive Radix Tree (DART), to address the challenge of distributed affix-based keyword search on HPC systems. This trie-based approach is scalable in achieving efficient affix-based search and alleviating imbalanced keyword distribution and excessive requests on keywords at scale. Our evaluation at different scales shows that, comparing with the "full string hashing" use case of the most popular distributed indexing technique - Distributed Hash Table (DHT), DART achieves up to 55× better throughput with prefix search and with suffix search, while achieving comparable throughput with exact and infix searches. Also, comparing to the "initial hashing" use case of DHT, DART maintains a balanced keyword distribution on distributed nodes and alleviates excessive query workload against popular keywords.

[1]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2019, J. Assoc. Inf. Sci. Technol..

[2]  Houjun Tang,et al.  Toward Scalable and Asynchronous Object-Centric Data Management for HPC , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[3]  Youngjae Kim,et al.  TagIt: An Integrated Indexing and Search Service for File Systems , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Yaning Liu,et al.  Implicit sampling combined with reduced order modeling for the inversion of vadose zone hydrological data , 2017, Comput. Geosci..

[5]  Arun Mannodi-Kanakkithodi,et al.  Mining materials design rules from data: The example of polymer dielectrics , 2017 .

[6]  Houjun Tang,et al.  SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Prabhat,et al.  Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss , 2017, 2017 New York Scientific Data Summit (NYSDS).

[8]  Shyue Ping Ong,et al.  Accurate Force Field for Molybdenum by Machine Learning Large Materials Data , 2017, 1706.09122.

[9]  Zhihan Lv,et al.  Toward Efficient and Flexible Metadata Indexing of Big Data Systems , 2017, IEEE Transactions on Big Data.

[10]  I-Min A. Chen,et al.  IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses , 2016, Nucleic Acids Res..

[11]  G. Merino,et al.  SEARCH FOR SOURCES OF HIGH-ENERGY NEUTRONS WITH FOUR YEARS OF DATA FROM THE ICETOP DETECTOR , 2016, 1607.05614.

[12]  Kai Ren,et al.  DeltaFS: exascale file systems scale better without dedicated servers , 2015, PDSW '15.

[13]  Kai Ren,et al.  IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[15]  Robert B. Ross,et al.  Mercury: Enabling remote procedure call for high-performance computing , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[16]  Viktor Leis,et al.  The adaptive radix tree: ARTful indexing for main-memory databases , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[17]  Guillaume Pierre,et al.  Wikipedia workload analysis for decentralized hosting , 2009, Comput. Networks.

[18]  Max Mühlhäuser,et al.  Optimally Efficient Prefix Search and Multicast in Structured P2P Networks , 2008, ArXiv.

[19]  Christopher D. Manning,et al.  Introduction to Information Retrieval: Web crawling and indexes , 2008 .

[20]  R. Nichol,et al.  SDSS-III: The Baryon Oscillation Spectroscopic Survey (BOSS) , 2007 .

[21]  Yuh-Jzer Joung,et al.  Keyword Search in DHT-Based Peer-to-Peer Networks , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[22]  Jun Wang,et al.  Foreseer: A Novel, Locality-Aware Peer-to-Peer System Architecture for Keyword Searches , 2004, Middleware.

[23]  Sriram Ramabhadran,et al.  Brief announcement: prefix hash tree , 2004, PODC '04.

[24]  Guangwen Yang,et al.  Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning , 2004, IPTPS.

[25]  Christian Scheideler,et al.  Peer-to-peer systems for prefix search , 2003, PODC '03.

[26]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[27]  S. Shenker,et al.  Complex Queries in DHT-based Peer-to-Peer Networks , 2002, IPTPS.

[28]  Michael Mitzenmacher,et al.  The Power of Two Choices in Randomized Load Balancing , 2001, IEEE Trans. Parallel Distributed Syst..

[29]  B. Everitt The Cambridge Dictionary of Statistics , 1998 .

[30]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[31]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[32]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[33]  John K. Ousterhout,et al.  Prefix Tables: A Simple Mechanism for Locating Files in a Distributed System , 1985, ICDCS.

[34]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[35]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[36]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[37]  Nenghai Yu,et al.  Distributed Hash Table , 2013, SpringerBriefs in Computer Science.

[38]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[39]  Sarunas Girdzijauskas,et al.  Distributed Hash Table , 2009, Encyclopedia of Database Systems.

[40]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Matrix decompositions and latent semantic indexing , 2008 .

[41]  Yuh-Jzer Joung,et al.  Keyword search in DHT-based peer-to-peer networks , 2007, IEEE Journal on Selected Areas in Communications.

[42]  Yuh-Jzer Joung,et al.  KISS: A Simple Prefix Search Scheme in P2P Networks , 2006, WebDB.

[43]  Sriram Ramabhadran,et al.  Prefix Hash Tree An Indexing Data Structure over Distributed Hash Tables , 2004, PODC 2004.

[44]  M. Harren Complex Queries in DHT-based Peer-to-Peer Networks , 2002 .

[45]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[46]  R. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[47]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .