Discovery of Power-Laws in Chemical Space

Power-law distributions have been observed in a wide variety of areas. To our knowledge however, there has been no systematic observation of power-law distributions in chemoinformatics. Here, we present several examples of power-law distributions arising from the features of small, organic molecules. The distributions of rigid segments and ring systems, the distributions of molecular paths and circular substructures, and the sizes of molecular similarity clusters all show linear trends on log-log rank/ frequency plots, suggesting underlying power-law distributions. The number of unique features also follow Heaps'-like laws. The characteristic exponents of the power-laws lie in the 1.5-3 range, consistently with the exponents observed in other power-law phenomena. The power-law nature of these distributions leads to several applications including the prediction of the growth of available data through Heaps' law and the optimal allocation of experimental or computational resources via the 80/20 rule. More importantly, we also show how the power-laws can be leveraged to efficiently compress chemical fingerprints in a lossless manner, useful for the improved storage and retrieval of molecules in large chemical databases.

[1]  E. Marris Chemistry society goes head to head with NIH in fight over public database , 2005, Nature.

[2]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[3]  M Rarey,et al.  Detailed analysis of scoring functions for virtual screening. , 2001, Journal of medicinal chemistry.

[4]  Lada A. Adamic,et al.  Internet: Growth dynamics of the World-Wide Web , 1999, Nature.

[5]  L. W. Beineke Chemical Applications of Graph Theory (A. T. Balaban, ed.) , 1978 .

[6]  C. Sparrow The Fractal Geometry of Nature , 1984 .

[7]  Charles Gide,et al.  Cours d'économie politique , 1911 .

[8]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[9]  Jocelyn Kaiser,et al.  Science resources. Chemists want NIH to curtail database. , 2005, Science.

[10]  Theo P. van der Weide,et al.  A formal derivation of Heaps' Law , 2005, Inf. Sci..

[11]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[12]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[13]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[14]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Method and Algorithms , 2002 .

[15]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[16]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[17]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[18]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time , 2007, J. Chem. Inf. Model..

[19]  S. Golomb Run-length encodings. , 1966 .

[20]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[21]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[22]  F. Allen The Cambridge Structural Database: a quarter of a million crystal structures and rising. , 2002, Acta crystallographica. Section B, Structural science.

[23]  Michael Mitzenmacher,et al.  Power laws for monkeys typing randomly: the case of unequal probabilities , 2004, IEEE Transactions on Information Theory.

[24]  Emma Marris American Chemical Society: Chemical reaction , 2005, Nature.

[25]  Michel L. Goldstein,et al.  Problems with fitting to the power-law distribution , 2004, cond-mat/0402322.

[26]  Jon M. Kleinberg,et al.  The Web as a Graph: Measurements, Models, and Methods , 1999, COCOON.

[27]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[28]  Gonzalo Navarro,et al.  Indexing Compressed Text , 1997 .

[29]  Egon L. Willighagen,et al.  The Blue Obelisk—Interoperability in Chemical Informatics , 2006, J. Chem. Inf. Model..

[30]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[31]  David Rogers,et al.  Cheminformatics analysis and learning in a data pipelining environment , 2006, Molecular Diversity.

[32]  Pierre Baldi,et al.  Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval , 2007, J. Chem. Inf. Model..

[33]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[34]  A. Balaban Chemical applications of graph theory , 1976 .

[35]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..

[36]  Pierre Baldi,et al.  Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval , 2007, J. Chem. Inf. Model..