A model-based approach for mining membrane protein crystallization trials

MOTIVATION Membrane proteins are known to play crucial roles in various cellular functions. Information about their function can be derived from their structure, but knowledge of these proteins is limited, as their structures are difficult to obtain. Crystallization has proved to be an essential step in the determination of macromolecular structure. Unfortunately, the bottleneck is that the crystallization process is quite complex and extremely sensitive to experimental conditions, the selection of which is largely a matter of trial and error. Even under the best conditions, it can take a large amount of time, from weeks to years, to obtain diffraction-quality crystals. Other issues include the time and cost involved in taking multiple trials and the presence of very few positive samples in a wide and largely undetermined parameter space. Therefore, any help in directing scientists' attention to the hot spots in the conceptual crystallization space would lead to increased efficiency in crystallization trials. RESULTS This work is an application case study on mining membrane protein crystallization trials to predict novel conditions that have a high likelihood of leading to crystallization. We use suitable supervised learning algorithms to model the data-space and predict a novel set of crystallization conditions. Our preliminary wet laboratory results are very encouraging and we believe this work shows great promise. We conclude with a view of the crystallization space that is based on our results, which should prove useful for future studies in this area.

[1]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[2]  John M. Rosenberg,et al.  Cluster analysis of the Biological Macromolecule Crystallization Database , 1992 .

[3]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[4]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[5]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[7]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[8]  Charu C. Aggarwal,et al.  Towards systematic design of distance functions for data mining applications , 2003, KDD '03.

[9]  B. Segelke,et al.  Efficiency analysis of sampling protocols used in protein crystallization screening , 2001 .

[10]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[11]  V. Cherezov,et al.  Crystallization screens: compatibility with the lipidic cubic phase for in meso crystallization of membrane proteins. , 2001, Biophysical journal.

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  Martin Caffrey,et al.  Membrane protein crystallization. , 2003, Journal of structural biology.

[14]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[15]  Mark Gerstein,et al.  Data mining crystallization databases: Knowledge‐based approaches to optimize protein crystal screens , 2003, Proteins.

[16]  David P. Chimento,et al.  Crystallization and initial X-ray diffraction of BtuB, the integral membrane cobalamin transporter of Escherichia coli. , 2003, Acta crystallographica. Section D, Biological crystallography.

[17]  Bernhard Rupp,et al.  Maximum-likelihood crystallization. , 2003, Journal of structural biology.

[18]  C. Carter Protein crystallization using incomplete factorial experiments. , 1979, The Journal of biological chemistry.

[19]  Vipin Kumar,et al.  Mining needle in a haystack: classifying rare classes via two-phase rule induction , 2001, SIGMOD '01.

[20]  Igor Jurisica,et al.  Intelligent decision support for protein crystal growth , 2001, IBM Syst. J..

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Chris Bailey-Kellogg,et al.  Ambiguity-Directed Sampling for Qualitative Analysis of Sparse Data from Spatially-Distributed Physical Systems , 2001, IJCAI.