User-friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries.

Directed evolution of proteins depends on the production of molecular diversity by random mutagenesis. While a number of methods have been developed for introducing this diversity, the best ways to sample it are not always clear. Here we used simple statistics to analyse completeness and diversity in randomized libraries generated by oligonucleotide-directed mutagenesis, error-prone polymerase chain reaction (epPCR) and in vitro recombination of highly homologous sequences. For oligonucleotide-directed mutagenesis, we derive equations to estimate how complete a given library is expected to be and also to predict the size of library required to give a fixed probability of being 100% complete. We describe the statistical bases for computer programs which estimate the number of distinct variants represented in epPCR and shuffled libraries, dubbed PEDEL and DRIVeR, respectively. These programs allow the user to calculate (rather than guess) the diversity represented in a given library and also provide empirical guidelines for maximizing this diversity. PEDEL and DRIVeR are available at www.bio.cam.ac.uk/ approximately blackburn/stats.html.