InChIKey is a 27-character compacted (hashed) version of InChI which is intended for Internet and database searching/indexing and is based on an SHA-256 hash of the InChI character string. The first block of InChIKey encodes molecular skeleton while the second block represents various kinds of isomerism (stereo, tautomeric, etc.). InChIKey is designed to be a nearly unique substitute for the parent InChI. However, a single InChIKey may occasionally map to two or more InChI strings (collision). The appearance of collision itself does not compromise the signature as collision-free hashing is impossible; the only viable approach is to set and keep a reasonable level of collision resistance which is sufficient for typical applications.We tested, in computational experiments, how well the real-life InChIKey collision resistance corresponds to the theoretical estimates expected by design. For this purpose, we analyzed the statistical characteristics of InChIKey for datasets of variable size in comparison to the theoretical statistical frequencies. For the relatively short second block, an exhaustive direct testing was performed. We computed and compared to theory the numbers of collisions for the stereoisomers of Spongistatin I (using the whole set of 67,108,864 isomers and its subsets). For the longer first block, we generated, using custom-made software, InChIKeys for more than 3 × 1010 chemical structures. The statistical behavior of this block was tested by comparison of experimental and theoretical frequencies for the various four-letter sequences which may appear in the first block body.From the results of our computational experiments we conclude that the observed characteristics of InChIKey collision resistance are in good agreement with theoretical expectations.
[1]
P. Pihko,et al.
Nonanomeric spiroketals in natural products: structures, sources, and synthetic strategies.
,
2005,
Chemical reviews.
[2]
R. Graham,et al.
Handbook of Combinatorics
,
1995
.
[3]
Xin-She Yang,et al.
Introduction to Algorithms
,
2021,
Nature-Inspired Optimization Algorithms.
[4]
M. Elyashberg,et al.
A new approach to computer-aided molecular structure elucidation: the expert system Structure Elucidator
,
1999
.
[5]
P. A. P. Moran,et al.
An introduction to probability theory
,
1968
.
[6]
Feller William,et al.
An Introduction To Probability Theory And Its Applications
,
1950
.
[7]
Antony J. Williams,et al.
Structure Elucidator: A Versatile Expert System for Molecular Structure Elucidation from 1D and 2D NMR Data and Molecular Fragments
,
2004,
J. Chem. Inf. Model..