Background: Extracting relevant biological information from large data sets is a major challenge in functional genomics research. Different aspects of the data hamper their biological interpretation. For instance, 5000-fold differences in concentration for different metabolites are present in a metabolomics data set, while these differences are not proportional to the biological relevance of these metabolites. However, data analysis methods are not able to make this distinction. Data pretreatment methods can correct for aspects that hinder the biological interpretation of metabolomics data sets by emphasizing the biological information in the data set and thus improving their biological interpretability. Results: Different data pretreatment methods, i.e. centering, autoscaling, pareto scaling, range scaling, vast scaling, log transformation, and power transformation, were tested on a real-life metabolomics data set. They were found to greatly affect the outcome of the data analysis and thus the rank of the, from a biological point of view, most important metabolites. Furthermore, the stability of the rank, the influence of technical errors on data analysis, and the preference of data analysis methods for selecting highly abundant metabolites were affected by the data pretreatment method used prior to data analysis. Conclusion: Different pretreatment methods emphasize different aspects of the data and each pretreatment method has its own merits and drawbacks. The choice for a pretreatment method depends on the biological question to be answered, the properties of the data set and the data analysis method selected. For the explorative analysis of the validation data set used in this study, autoscaling and range scaling performed better than the other pretreatment methods. That is, range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA (principal component analysis). In conclusion, selecting a proper data pretreatment method is an essential step in the analysis of metabolomics data and greatly affects the metabolites that are identified to be the most important. Published: 08 June 2006 BMC Genomics 2006, 7:142 doi:10.1186/1471-2164-7-142 Received: 20 February 2006 Accepted: 08 June 2006 This article is available from: http://www.biomedcentral.com/1471-2164/7/142 © 2006 van den Berg et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 15 (page number not for citation purposes) BMC Genomics 2006, 7:142 http://www.biomedcentral.com/1471-2164/7/142 Background Functional genomics approaches are increasingly being used for the elucidation of complex biological questions with applications that range from human health [1] to microbial strain improvement [2]. Functional genomics tools have in common that they aim to measure the complete biomolecule response of an organism to the environmental conditions of interest. While transcriptomics and proteomics aim to measure all mRNA and proteins, respectively, metabolomics aims to measure all metabolites [3,4]. In metabolomics research, there are several steps between the sampling of the biological condition under study and the biological interpretation of the results of the data analysis (Figure 1). First, the biological samples are extracted and prepared for analysis. Subsequently, different data preprocessing steps [3,5] are applied in order to generate 'clean' data in the form of normalized peak areas that reflect the (intracellular) metabolite concentrations. These clean data can be used as the input for data analysis. However, it is important to use an appropriate data pretreatment method before starting data analysis. Data pretreatment methods convert the clean data to a different scale (for instance, relative or logarithmic scale). Hereby, they aim to focus on the relevant (biological) information and to reduce the influence of disturbing factors such as measurement noise. Procedures that can be used for data pretreatment are scaling, centering and transformations. In this paper, we discuss different properties of metabolomics data, how pretreatment methods influence these properties, and how the effects of the data pretreatment methods can be analyzed. The effect of data pretreatment will be illustrated by the application of eight data pretreatment methods to a metabolomics data set of Pseudomonas putida S12 grown on four different carbon sources. Properties of metabolome data In metabolomics experiments, a snapshot of the metabolome is obtained that reflects the cellular state, or phenotype, under the experimental conditions studied [3]. The experiments that resulted in the data set used in this paper were conducted according to an experimental design. In an experimental design, the experimental conditions are purposely chosen to induce variation in the area of interest. The resulting variation in the metabolome is called induced biological variation. However, other factors are also present in metabolomics data: 1. Differences in orders of magnitude between measured metabolite concentrations; for example, the average concentration of a signal molecule is much lower than the average concentration of a highly abundant compound like ATP. However, from a biological point of view, metabolites present in high concentrations are not necessarily more important than those present at low concentrations. 2. Differences in the fold changes in metabolite concentration due to the induced variation; the concentrations of metabolites in the central metabolism are generally relaThe different steps between biological sampling and ranking of the most important m tabolites Figure 1 The different steps between biological sampling and ranking of the most important metabolites. Biological experiment
[1]
M. J. van der Werf,et al.
Quenching of microbial samples for increased reliability of microarray data.
,
2006,
Journal of microbiological methods.
[2]
T. Hankemeier,et al.
Microbial metabolomics with gas chromatography/mass spectrometry.
,
2006,
Analytical chemistry.
[3]
Peter D. Karp,et al.
MetaCyc: a multiorganism database of metabolic pathways and enzymes
,
2005,
Nucleic Acids Res..
[4]
A. Smilde,et al.
Fusion of mass spectrometry-based metabolomics data.
,
2005,
Analytical chemistry.
[5]
T. Hankemeier,et al.
Microbial metabolomics: replacing trial-and-error by the unbiased selection and ranking of targets
,
2005,
Journal of Industrial Microbiology and Biotechnology.
[6]
W. Matson,et al.
Analytical precision, biological variation, and mathematical normalization in high data density metabolomics
,
2005,
Metabolomics.
[7]
Emmanuel Dias-Neto,et al.
Large-scale transcriptome analyses reveal new genetic marker candidates of head, neck, and thyroid cancer.
,
2005,
Cancer research.
[8]
Age K. Smilde,et al.
Analysis of longitudinal metabolomics data
,
2004,
Bioinform..
[9]
T. Ebbels,et al.
Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling
,
2003
.
[10]
R. Bro,et al.
Centering and scaling in component analysis
,
2003
.
[11]
T. Fearn.
The Jackknife
,
2000
.
[12]
S. Stein.
An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data
,
1999
.
[13]
D. Botstein,et al.
Cluster analysis and display of genome-wide expression patterns.
,
1998,
Proceedings of the National Academy of Sciences of the United States of America.
[14]
J. Visser,et al.
Determination of intermediary metabolites in Aspergillus niger.
,
1996
.
[15]
H. R. Keller,et al.
Evolving factor analysis in the presence of heteroscedastic noise
,
1992
.
[16]
J. Edward Jackson,et al.
A User's Guide to Principal Components.
,
1991
.
[17]
M. J. van der Werf,et al.
Bacterial degradation of styrene involving a novel flavin adenine dinucleotide-dependent styrene monooxygenase
,
1990,
Applied and environmental microbiology.
[18]
W. A. Scheffers,et al.
Physiology of Saccharomyces cerevisiae in anaerobic glucose-limited chemostat cultures.
,
1990,
Journal of general microbiology.
[19]
D. Cox,et al.
An Analysis of Transformations
,
1964
.
[20]
Bart Pieterse,et al.
Multivariate analysis of microarray data by principal component discriminant analysis: prioritizing relevant transcripts linked to the degradation of different carbohydrates in Pseudomonas putida S12.
,
2006,
Microbiology.
[21]
M. J. van der Werf,et al.
Towards replacing closed with open target selection strategies.
,
2005,
Trends in biotechnology.
[22]
O. Fiehn.
Metabolomics – the link between genotypes and phenotypes
,
2004,
Plant Molecular Biology.
[23]
Hiroyuki Ogata,et al.
KEGG: Kyoto Encyclopedia of Genes and Genomes
,
1999,
Nucleic Acids Res..
[24]
Yizeng Liang,et al.
Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise
,
1994
.