Statistics, data mining, and machine learning in astronomy: a practical Python guide for the analysis of survey data, by Željko Ivezić, Andrew J. Connolly, Jacob T. VanderPlas and Alexander Gray
暂无分享,去创建一个
Statistics is a science of data collection. Although a science, it is actually a branch of mathematics. In statistics, data-sets are distributed in such a way that they form a certain shape, the most common of which is the bell-shaped curve known also as the bell curve, normal distribution and the Gaussian. Binomial, χ2 and Poisson, among others, are other statistical distributions that the readers are already familiar with or at least, have heard of; but what of data mining and machine learning? Data mining and machine learning can be defined in many ways. Here, the authors describe data mining as a set of techniques for analysing and describing structured data; and machine learning as one for interpreting data by comparing them to models for data behaviour. In essence, data mining and machine learning are about looking for patterns and relationships in data. Perhaps an easier way to look at it is through the readers’ own experience with supermarket and social media. By means of the bonus-point programme, supermarket accumulates customers’ shopping habits and interests. Ever notice when the readers stop buying something, say a dog food, after a while there are emails (or snail mails even) of the same enticing the readers to go and buy one (Sales now on! etc., never mind that the dog has been put to sleep)? That is all data mining, popular with the customer relationship management team. Or, any kind of News Feed the readers get on Facebook? How does Facebook know the readers want to see the video of the cat first thing in the morning? Facebook algorithm essentially learns (yes, on its own) via statistical analysis what the readers like (and ‘Like’) and bingo! The cat’s Feed is at the top. The fact that the algorithm can do this without being told is machine learning, a form of artificial intelligence. What is this book about? One half of this book is about statistics. The other is about using statistics in astronomy and astrophysics. As the readers already know, scientists, astronomers included, are not clueless when it comes to statistics – they have been using it for ages. If that is so, why do they need this book then? The truth is, the field of statistics has evolved by leaps and bounds, leaving astronomers stuck in the same old rut, using the same old technique that is proven to be more and more cumbersome each day in dealing with the amount of data that easily run to the terabytes. (It seems astronomers no longer count data by the numbers but by (computer) memory instead.) In a word, the book is about teaching astronomers the art of analysing very large data-sets by introducing a more efficient way of using statistics. It is a practical guide to data mining by utilising tools such as unsupervised classification, clustering, principal component analysis, locally linear embedding and the somewhat exotic-sounding, projection pursuit; and to machine learning by methods like Bayesian, supervised classification, maximum likelihood estimator and regression. (Supervised simply means something about the data is already known while unsupervised means nothing is.) The combination of the three subjects is certainly special as usually, the readers would have to get different text(s) for each. The way to read this book is by doing i.e. by practicing the computer codes to reproduce the graphs and figures as the readers go through it so the pros and cons of any particular method can be studied. To make the practice accessible to all, the authors have used AstroML1 based on Python2 in their sample analyses. Python is an open-source object-oriented language that is quickly gaining popularity in the scientific community not least because it is free hence available to everybody; and AstroML is its module for data mining and machine learning. Nothing needs to be done other than installing Python and the AstroML package to start the ball rolling. For the purpose of this book, data-sets from Sloan Digital Sky Survey are used. There is no need to worry about their large volume either as the data are maintained on GitHub,3 the online repository. Of course, the readers can also choose to use their own data and modify the codes as they see fit. Who is this book for? Obviously it is written for those already in the fields of astronomy and astrophysics that collect data by the millions like extragalactic astronomy, exoplanets, etc. Its value can only be appreciated by researches looking for ways of interpreting the astronomical amount of astronomical data (OK, pun intended) by way of statistics.