CHAPTER 5:Statistics, Data Mining and Modeling

Once mass spectrometry data have been pre-processed to discover what the true peaks are, they can be used in different ways. For instance, one may want to build a predictive model that can differentiate between two conditions (e.g. case versus control) and classify new samples or discover differentially expressed molecules. To accomplish these and other tasks, adequate statistics and data mining techniques should be chosen and applied. With these goals in mind, this chapter aims to present different strategies for sample comparison, dimensionality reduction techniques (e.g. Principal Component Analysis), cluster analysis methods (e.g. hierarchical clustering analysis), different ways to find important variables (e.g. biomarker discovery), and the creation and evaluation of predictive models based on machine learning techniques. These topics are covered in a practical way, showing reusable examples that use real, publicly-available mass spectrometry datasets.