RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints

The ability to search for a query molecule on massive molecular repositories is a fundamental task in chemoinformatics and drug-discovery. Chemical fingerprints are commonly used to characterize the structure and properties of molecules. Some fingerprints, particularly unfolded fingerprints, are often of extreme high dimension and sparse where only few features have a positive value. In this work, we propose a new searching algorithm, RISC, which exploits sparsity in high-dimensional fingerprints to derive effective pruning mechanisms and dramatically speed-up searching efficiency. RISC is robust enough to work on both binary and nonbinary chemical fingerprints. Extensive experiments on Range Queries and Top-k Queries across several molecular repositories demonstrate that at fingerprints of dimension 2048 and above, which is often the case with unfolded fingerprints, RISC is consistently faster than the state-of-the-art techniques. The source code of our implementation is available at http://www.cse.iitd.ac.in/~sayan/software.html .