Searching Protein 3-D Structures in Linear Time

One of the most important issues in the post-genomic molecular biology is the analysis of protein three-dimensional (3-D) structures, and searching over the 3-D structure databases of them is becoming more and more important. The root mean square deviation (RMSD) is the most popular similarity measure for comparing two molecular structures. In this article, we propose new theoretically and practically fast algorithms for the basic problem of finding all the substructures of structures in a structure database of chain molecules (such as proteins), whose RMSDs to the query are within a given constant threshold. The best-known worst-case time complexity for the problem is O(N log m), where N is the database size and m is the query size. The previous best-known expected time complexity for the problem is also O(N log m). We also propose a new breakthrough linear-expected-time algorithm. It is not only a theoretically significant improvement over previous algorithms, but also a practically faster algorithm, according to computational experiments. Our experiments over the whole Protein Data Bank (PDB) database show that our algorithm is 3.6-28 times faster than previously known algorithms, to search for similar substructures whose RMSDs are within 1A to queries of ordinary lengths. We also propose a series of preprocessing algorithms that enable faster queries, though there have been no known indexing algorithm whose query time complexity is better than the above O(N log m) bound. One is an O(N log(2)N)-time and O(N log N)-space preprocessing algorithm with expected query time complexity of O(m + N given complex square root of m). Another is an O(N log N)-time and O(N)-space preprocessing algorithm with expected query time complexity of O(N given complex square root of m + m log (N given m)).(1)

[1]  K. S. Arun,et al.  Least-Squares Fitting of Two 3-D Point Sets , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[3]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[4]  H. Kramers The Behavior of Macromolecules in Inhomogeneous Flow , 1946, Master of Modern Physics.

[5]  Tetsuo Shibuya,et al.  Efficient Substructure RMSD Query Algorithms , 2007, J. Comput. Biol..

[6]  P. Gennes Scaling Concepts in Polymer Physics , 1979 .

[7]  Tetsuo Shibuya,et al.  Prefix-Shuffled Geometric Suffix Tree , 2007, SPIRE.

[8]  Richard H. Boyd,et al.  The science of polymer molecules : an introduction concerning the synthesis, structure and properties of the individual molecules that constitute polymeric materials , 1996 .

[9]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[10]  Micha Sharir,et al.  Identification of Partially Obscured Objects in Two and Three Dimensions by Matching Noisy Characteristic Curves , 1987 .

[11]  M. Gerstein Integrative database analysis in structural genomics , 2000, Nature Structural Biology.

[12]  Robert B. Fisher,et al.  Estimating 3-D rigid body transformations: a comparison of four major algorithms , 1997, Machine Vision and Applications.

[13]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[14]  Christos H. Papadimitriou,et al.  Algorithmic aspects of protein structure similarity , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[15]  Kian-Lee Tan,et al.  Rapid retrieval of protein structures from databases. , 2007, Drug discovery today.

[16]  Jean Dayantis,et al.  Monte Carlo precise determination of the end-to-end distribution function of self-avoiding walks on the simple-cubic lattice , 1991 .

[17]  William R. Taylor,et al.  Structure Comparison and Structure Patterns , 2000, J. Comput. Biol..

[18]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[19]  Tetsuo Shibuya Geometric Suffix Tree: A New Index Structure for Protein 3-D Structures , 2006, CPM.

[20]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .