Metabolomics encompasses analysis of metabolites using profiling techniques such as mass spectroscopy (MS) and nuclear magnetic resonance (NMR). Statistical analysis is performed on the profiled data to determine variations in the levels of metabolites. The goal here is to reveal relationships between the variations in the concentrations of metabolites and specific pathophysiological conditions such as diseases or external factors. Metabolomics has been widely used to characterize metabolites in various body fluids such as saliva, serum and urine in various fields of medical research including cancer [3], cardialogy [6], diabetes [5], human infections [12], neurology [7], neonatology [4] and respiratory diseases [2] to name a few. In the statistical analysis of metabolomics data, many methods are used which can be categorized as univariate and multivariate analysis methods. Univariate methods are very commonly applied due to their ease of use and interpretation. These methods consider metabolomic features (variables) one at a time independent of each other, thus, ignoring correlations with other features. Moreover, as pointed by Alonso et al. [1], these methods ignore confounding variables such as age, gender, body mass index (BMI), which may lead to incorrect results [13, 15]. On the other hand, multivariate methods consider all the features and their correlations during data analysis. These methods include unsupervised methods such as principal component analysis (PCA), and supervised methods such as partial least squares (PLS) and support vector machine (SVM). Alonso et al. has provided a review of univariate and multivariate methods used in metabolomics. To the best of our knowledge, there are many state of the art statistical methods that have not be used for metabolomic data analysis. A significant advantage of these methods over commonly used methods is their ability to process high-dimensional data. Along with state-of-the-art statistical methods we have used differential network analysis to identify variations at system level. In this work we have analyzed urine samples from Qatar Metabolomics Study on Diabetes (QMDiab) for identification of potential biomarkers. QMDiab was conducted by Hamad Medical Corporation, Qatar (HMC) and Weill Cornell Medical College, Qatar in 2012 with approval from the Institutional Review Boards of HMC and Weill Cornell Medical College-Qatar (Research Protocol number 11131/11). Written informed consent was obtained from all participants. Subjects in the study included males and females from Arab and Asian ethnicities aging 17-81 years. Urine samples were sent to Chenomx Inc., Alberta, Canada for proton nuclear magnetic resonance (1H NMR). Although the original study was targeting investigation of type 2 diabetes, in this paper we are focusing on obesity as well by using BMI as a representative measure of obesity. In this work we have used regularization models and differential network analysis. We have used the elastic net, glinternet, the lasso projection and high-dimensional inference. The elastic net uses L1 and L2 penalty resulting in a mix of ridge and lasso regression. The glinternet is a group-lasso based method developed by Lim and Hastie [9]. The method learns pairwise interactions of variables in linear regression models satisfying strong hierarchy. The lasso projection (lasso proj) or de-sparsified lasso is a regularization based method that performs statistical inference of low dimensional parameters with high dimensional data [17]. The method uses low dimension projection approach to construct confidence intervals for the estimated regression parameters. The high-dimensional inference computes P-values of variables and associated confidence intervals in high-dimensional data [10]. Further, we performed differential network analysis to identify variable interactions, which differentiate between diabetic and non-diabetic, or obese and lean subjects. The network is constructed using mutual information between the variables for different groups of samples. We applied the differential network analysis, dGHD algorithm, proposed by Ruan et al. [14] for detecting interaction patterns, which differentiate two networks. The algorithm uses the Generalised Hamming Distance (GHD) for calculating topological differences between the networks along with computation of their statistical significance. It is astonishing that the proposed methods, which have not been applied in the field yet, identify potential biomarkers, proposed in the literature by previous studies, in a small dataset. The results for the elastic net, the glinternet and the lasso proj are summarized in Table 1. For diabetes analysis, identified significant variables include age, betaine, glycolate and glucose, well known biomarkers for diabetes [8, 11]. For obesity analysis, identified significant variables include age, dimethylamine, succinate and cis-aconitate, previously identified by [16]. The high-dimensional inference only identified age and betaine for diabetes study. We conclude that state-of-the-art statistical and network analysis methods can be used for metabolomics data analysis for datasets with limited number of samples. The number of metabolomic features is increasing with the advancement of technologies. The ability of these methods to handle high-dimensional data make them suitable in the settings where the number of samples is smaller than the number of features. These methods can help in identification potential biomarkers in future studies.
[1]
A. Astrup,et al.
Standardization of factors that influence human urine metabolomics
,
2011,
Metabolomics.
[2]
Peter Bühlmann,et al.
p-Values for High-Dimensional Regression
,
2008,
0811.2177.
[3]
O. Mayboroda,et al.
Metabolomic investigations of human infections.
,
2012,
Bioanalysis.
[4]
L. Barberini,et al.
Metabolomics in paediatric respiratory diseases and bronchiolitis
,
2011,
The journal of maternal-fetal & neonatal medicine : the official journal of the European Association of Perinatal Medicine, the Federation of Asia and Oceania Perinatal Societies, the International Society of Perinatal Obstetricians.
[5]
Peter Kraft,et al.
Reproducibility of metabolomic profiles among men and women in 2 large cohort studies.
,
2013,
Clinical chemistry.
[6]
Arnald Alonso,et al.
Analytical Methods in Untargeted Metabolomics: State of the Art in 2015
,
2015,
Front. Bioeng. Biotechnol..
[7]
A. Sinclair,et al.
The role of metabolomics in neurological disease
,
2012,
Journal of Neuroimmunology.
[8]
Giovanni Montana,et al.
Differential analysis of biological networks
,
2015,
BMC Bioinformatics.
[9]
M. Waters,et al.
Investigating Potential Mechanisms of Obesity by Metabolomics
,
2012,
Journal of biomedicine & biotechnology.
[10]
M. Lever,et al.
Variability of plasma and urine betaine in diabetes mellitus and its relationship to methionine load test responses: an observational study
,
2012,
Cardiovascular Diabetology.
[11]
T. Hastie,et al.
Learning Interactions via Hierarchical Group-Lasso Regularization
,
2015,
Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.
[12]
Luigi Atzori,et al.
Metabolomics as a tool for cardiac research
,
2011,
Nature Reviews Cardiology.
[13]
Cun-Hui Zhang,et al.
Confidence intervals for low dimensional parameters in high dimensional linear models
,
2011,
1110.2563.
[14]
R. Buják,et al.
Metabolomics in urogenital cancer.
,
2011,
Bioanalysis.
[15]
H. Daniel,et al.
Glyoxylate, a New Marker Metabolite of Type 2 Diabetes
,
2014,
Journal of diabetes research.
[16]
L. Barberini,et al.
Clinical application of metabolomics in neonatology
,
2012,
The journal of maternal-fetal & neonatal medicine : the official journal of the European Association of Perinatal Medicine, the Federation of Asia and Oceania Perinatal Societies, the International Society of Perinatal Obstetricians.
[17]
Nele Friedrich,et al.
Metabolomics in diabetes research.
,
2012,
The Journal of endocrinology.