SECIMTools: a suite of metabolomics data analysis tools

BackgroundMetabolomics has the promise to transform the area of personalized medicine with the rapid development of high throughput technology for untargeted analysis of metabolites. Open access, easy to use, analytic tools that are broadly accessible to the biological community need to be developed. While technology used in metabolomics varies, most metabolomics studies have a set of features identified. Galaxy is an open access platform that enables scientists at all levels to interact with big data. Galaxy promotes reproducibility by saving histories and enabling the sharing workflows among scientists.ResultsSECIMTools (SouthEast Center for Integrated Metabolomics) is a set of Python applications that are available both as standalone tools and wrapped for use in Galaxy. The suite includes a comprehensive set of quality control metrics (retention time window evaluation and various peak evaluation tools), visualization techniques (hierarchical cluster heatmap, principal component analysis, modular modularity clustering), basic statistical analysis methods (partial least squares - discriminant analysis, analysis of variance, t-test, Kruskal-Wallis non-parametric test), advanced classification methods (random forest, support vector machines), and advanced variable selection tools (least absolute shrinkage and selection operator LASSO and Elastic Net).ConclusionsSECIMTools leverages the Galaxy platform and enables integrated workflows for metabolomics data analysis made from building blocks designed for easy use and interpretability. Standard data formats and a set of utilities allow arbitrary linkages between tools to encourage novel workflow designs. The Galaxy framework enables future data integration for metabolomics studies with other omics data.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  Richard D Beger,et al.  Metabolomics approaches for discovering biomarkers of drug-induced hepatotoxicity and nephrotoxicity. , 2010, Toxicology and applied pharmacology.

[4]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[5]  F. Collins,et al.  Policy: NIH plans to enhance reproducibility , 2014, Nature.

[6]  M. Orešič,et al.  Data processing for mass spectrometry-based metabolomics. , 2007, Journal of chromatography. A.

[7]  K. Cusi,et al.  Improved experimental data processing for UHPLC–HRMS/MS lipidomics applied to nonalcoholic fatty liver disease , 2017, Metabolomics.

[8]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[9]  David S. Wishart,et al.  MetaboAnalyst: a web server for metabolomic data analysis and interpretation , 2009, Nucleic Acids Res..

[10]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[11]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[12]  K. Senthamarai Kannan,et al.  Outlier detection in multivariate data , 2015 .

[13]  A. Beckett,et al.  AKUFO AND IBARAPA. , 1965, Lancet.

[14]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[15]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[16]  Markus Perola,et al.  Genome-wide association study identifies multiple loci influencing human serum metabolite levels , 2012, Nature Genetics.

[17]  Mark R. Viant,et al.  Galaxy-M: a Galaxy workflow for processing and analyzing direct infusion and liquid chromatography mass spectrometry-based metabolomics data , 2016, GigaScience.

[18]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[19]  Daniel Jacob,et al.  Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics , 2014, Bioinform..

[20]  Corey D. DeHaven,et al.  Integrated, nontargeted ultrahigh performance liquid chromatography/electrospray ionization tandem mass spectrometry platform for the identification and relative quantification of the small-molecule complement of biological systems. , 2009, Analytical chemistry.

[21]  Leland Wilkinson,et al.  The History of the Cluster Heat Map , 2009 .

[22]  Charles E. Brown Coefficient of Variation , 1998 .

[23]  Matej Oresic,et al.  MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data , 2006, Bioinform..

[24]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[25]  Kenneth K. Lopiano,et al.  RNA-seq: technical variability and sampling , 2011, BMC Genomics.

[26]  Eoin Fahy,et al.  Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools , 2015, Nucleic Acids Res..

[27]  S. Dudoit,et al.  STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS , 2002 .

[28]  David Huard,et al.  PyMC: Bayesian Stochastic Modelling in Python. , 2010, Journal of statistical software.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  R. Poldrack,et al.  The publication and reproducibility challenges of shared data , 2015, Trends in Cognitive Sciences.

[31]  D. Altman,et al.  STATISTICAL METHODS FOR ASSESSING AGREEMENT BETWEEN TWO METHODS OF CLINICAL MEASUREMENT , 1986, The Lancet.

[32]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[33]  F. Collins,et al.  NIH plans to enhance reproducibility , 2014 .

[34]  E. Stone,et al.  Modulated Modularity Clustering as an Exploratory Tool for Functional Genomic Inference , 2009, PLoS genetics.

[35]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[36]  Kazuki Saito,et al.  Potential of metabolomics as a functional genomics tool. , 2004, Trends in plant science.

[37]  Benjamin P Bowen,et al.  Mass spectrometry-based metabolomics, analysis of metabolite-protein interactions, and imaging. , 2010, BioTechniques.

[38]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[39]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[40]  Age K Smilde,et al.  Atherosclerosis and liver inflammation induced by increased dietary cholesterol intake: a combined transcriptomics and metabolomics analysis , 2007, Genome Biology.

[41]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[42]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[43]  Yan Liang,et al.  Recent development in liquid chromatography/mass spectrometry and emerging technologies for metabolite identification. , 2011, Current drug metabolism.

[44]  Barry McDonald A teaching note on Cook's distance - a guideline , 2002 .

[45]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[46]  Jasper Engel,et al.  Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling , 2016, Metabolomics.

[47]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[48]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[49]  Elaine Holmes,et al.  Metabonomics technologies and their applications in physiological monitoring, drug safety assessment and disease diagnosis , 2004, Biomarkers : biochemical indicators of exposure, response, and susceptibility to chemicals.

[50]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[51]  Rory A. Fisher,et al.  Studies in crop variation. I. An examination of the yield of dressed grain from Broadbalk , 1921, The Journal of Agricultural Science.

[52]  Piotr S. Gromski,et al.  Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data , 2014, Metabolites.

[53]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[54]  Russell D. Wolfinger,et al.  The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster , 2001, Nature Genetics.

[55]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[56]  I. Cockburn,et al.  The Economics of Reproducibility in Preclinical Research , 2015, PLoS biology.

[57]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[58]  Mark R. Segal,et al.  Machine Learning Benchmarks and Random Forest Regression , 2004 .

[59]  R. Weinshilboum,et al.  Metabolomics: a global biochemical approach to drug response and disease. , 2008, Annual review of pharmacology and toxicology.

[60]  Yu Guo,et al.  Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms , 2010, BMC Bioinformatics.

[61]  E. Stone,et al.  Systems Genetics of Complex Traits in Drosophila melanogaster , 2009, Nature Genetics.

[62]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[63]  B. Misra,et al.  Updates in metabolomics tools and resources: 2014–2015 , 2016, Electrophoresis.