An Introduction to Bioinformatics Algorithms
暂无分享,去创建一个
Chapters 3 and 4 describe the K-means and Ward approaches to clustering, together with some modifications and nuances. Mirkin has been able to describe these approaches nicely in his framework, and these two chapters pave the way for Chapter 5, in which the “data recovery” framework is applied to clustering and compared in its application to other familiar statistical models such as regression. Along the way, Mirkin proves a number of interesting points, some on data standardization, many regarding the ANOVA-like decomposition of the “data scatter” into explained (by the clustering) and unexplained parts. These three chapters make up the heart of the book and will reward the reader. The decompositions in particular provide a solid theoretical underpinning for clustering that should take root in the community at large. The final two chapters deal with more tangential topics. Chapter 6 surveys competing clustering approaches and tries (with varying levels of success) to bring them under the “data recovery” umbrella. Chapter 7 attacks what the author calls “general issues” like missing values and cluster validation. The book does have several major flaws, however. By its reference to “Data Mining,” the title leads the reader to believe that the book will specifically address the problem of clustering applied to large, heterogeneous datasets. Computational details of implementation should be important. However there is a near-total lack of description regarding how algorithms should actually be implemented. Mirkin provides almost no details of how algorithms should be designed, of what the comparative costs of the different approaches might be, of how parallel processing might be brought to bear, or other computational issues. A second flaw regards the limited discussion of cluster validation. How does the user know that a clustering has “succeeded” in the sense of discovering meaningful clusters, and how does he or she interpret the resulting groups? These are difficult and context-dependent questions, to be sure, but Mirkin’s final subsection is insufficient to handle this important problem. Finally, the writing is in need of an overhaul by a native speaker. Mirkin’s sentences are understandable, but they are too often littered with little annoyances like missing articles and obsolete or awkward turns of phrase. The number of outright typographical errors seems small, though not zero; one obvious offender in Chapter 2 features a table whose column headers are simultaneously renamed and inadvertently switched. All in all, this book is a valuable contribution to the clustering literature but almost entirely from a theoretical perspective. Active data miners who rely on commercial software to do their clustering will find little here to help them. Programmers will find descriptions of techniques, though without explicitly stated efficient computational algorithms. On the other hand, theoreticians and those hoping to bring clustering closer to the more familiar territory of regression and classification should find this book useful.