Probabilistic Approaches in Activity Prediction

Biological activity has the probabilistic nature, and the most appropriate approaches in activity prediction are based on the theory of probability. The statistical nature of maximum likelihood method and Bayesian approach is well recognized, but many other methods (multiple regression, factor analysis, pattern recognition methods such as linear discriminant analysis, linear learning machine, support vector machines etc.) 1-3 can also be considered as probabilistic ones. 4,5 An informational search in PubMed Central with the queries “(probabilistic approach) OR (probabilistic method)” or “(statistical approach) OR (statistical method)”, will find 3,477 documents or 180,475 documents, respectively. It is impossible to analyze all these publications, particularly taking into account that, despite of the presence of this term in their titles many of them are not really probabilistic (see, for instance, refs 6−20). We propose the following definition of probabilistic approaches: “The methods that use probabilities as an essential part of the algorithm, and/or for which the results of application are presented as probability estimates”. Thus, many approaches that do not correspond strictly to the definition, are not considered in this chapter. Since data on general dose-response relationships are not available in many cases, biological activity is often represented by a single quantitative or even qualitative characteristic. Therefore, many training sets are created with activity data presented in such mode. These probabilistic ligand-based drug design methods are further used for virtual screening. Existing training sets are not ideal, not just due to the simplified definition of biological activity, but also because (i) no one activity is represented by all relevant chemical classes and (ii) no one compound has been tested against all kinds of biological activity. So, the probabilistic character of biological activity is caused not only by experimental errors of its determination but also by incompleteness of available information. Typically, virtual screening methods are used to select hits with a single required activity, 21-24 while the final aim of pharmaceutical R & D is to identify safety and potent leads and drug-candidates. 25-28 To overcome this problem, the authors have developed a method for prediction of many kinds of biological activity simultaneously based on the structural formula of chemical compound, which is realized in the computer program PASS (Prediction of Activity Spectra for Substances). 29,30 PASS provides the means for evaluation of general biological activity profile at the early stages of R & D, and thus its prediction can be used as a basis for the selection of compounds with the required kinds of biological activity but without unwanted ones. 31,32