Stekel, D. Microarray bioinformatics

Microarray technology offers biologists the chance to measure the expression levels of tens of thousands of mRNA species simultaneously, by quantifying fluorescence levels of dye‐labelled mRNAs bound to their complementary targets on a glass slide. The design, execution and analysis of microarray experiments requires a wide range of practical, computing and statistical knowledge. Gaining the necessary background information from the primary literature would be time consuming, hence the availability of several text books in this field. According to the back cover, Dov Stekel’s book sets out to be a ‘comprehensive guide to all of the mathematics, statistics and computing you will need to successfully operate DNA microarray experiments’. This is not a text designed to provide any practical information for the laboratory‐based researcher: it focuses instead on how to design arrays and array experiments and how to capture and analyse the resulting data. The first chapter is an excellent overview of alternative array platforms which provides the reader with enough impartial information to select the most appropriate and cost‐effective platform for their needs and to understand the strengths and weaknesses of their chosen system when in comes to data analysis. Chapters two and three concentrate on array design by introducing the concept of a unigene set and providing pointers to several databases that host gene cluster information before dealing with the selection of oligonucleotide probes. Chapters four and five deal with capturing and normalizing array data, including image processing and data normalization. Chapter six deals with measuring and understanding experimental variability, which proves to be the key to a great deal of the statistical analysis and experimental design issues which appear in subsequent chapters. Chapters seven to nine focus on statistical analysis and break this down into (1) the detection of differentially expressed genes, (2) clustering genes according to their expression profile, and (3) classifying samples by gene expression. The book rounds off by dealing with experimental design (this has to be placed at the end because we need to understand the statistics of the preceding chapters to fully appreciate it) and finally introduces data storage standards. The text throughout is clearly written in a pleasingly succinct style, and does not ever fall into the common trap of ‘preaching’ statistics. The text is liberally punctuated with useful graphs tables and illustrations that help to summarize data and convey key concepts. Whilst all of the chapters are well‐written and provide a great deal of useful information, those covering array design fall well short of providing enough information to allow a biologist to create a custom array of more than a few hundred features. There is, for example, no mention of the sequence clustering tools that would be required to make a custom array using in‐house sequences or any suggestion as to how one could parse the pre‐built cluster information held at databases such as TIGR. Similarly, the chapter on oligonucleotide probe design discusses the theory behind melting temperature prediction but does not provide the reader with any pointers to high throughput design programs such as Primer3. In reality, designing a large custom array requires programming skills that would be beyond the scope of any book on microarrays, so this is more of a practical limitation than a criticism of the text. It is in the chapters on statistical analysis that Dov Stekel really comes in to his own, effortlessly explaining the parametric and non‐parametric statistical tools that can be used to find significantly differentially expressed genes, and using brilliant analogies to convey concepts such as dimensionality reduction. For a biologist with a basic working knowledge of statistics the amount of information provided is absolutely spot‐on throughout the text. For example, we are given the formula for the t‐test, because one could easily apply this by hand to some test data, but we are spared the details of ANOVA because we will all use a statistical package or programming module to implement it. Rather than simply present the details of several tests, the author always makes the effort to compare the alternative approaches in plain English such that readers will be able to decide for themselves which tests are appropriate. The author clearly understands that this is the key to using statistics for a non‐expert. Most users of microarrays will already use one of the popular gene expression analysis packages for their data analysis. This book covers all of the popular techniques used by these packages, including the t‐test, ANOVA, PCA, K‐means clustering and so on, and will almost certainly provide such users with better explanations of the choice and use of these techniques than they will get from their program documentation. It will also educate users in the basic issues such as the benefits of log‐transforming data and the importance of applying multiple test correction to the parallel significance testing of thousands of genes. Additionally, this text will encourage researchers to employ techniques not present in existing commercial packages, such as parametric bootstrapping of gene clusters and the use of neural networks and genetic algorithms for sample classification. Using these advanced techniques means getting to grips with a statistical programming environment such as ‘R’, but this book motivates the reader to make just such a progression. This is a book that will enlighten and inspire the biologist rather than confuse or intimidate them. I wholeheartedly recommend it to anybody with an interest in array analysis.