Derivative component analysis for mass spectral serum proteomic profiles

BackgroundAs a promising way to transform medicine, mass spectrometry based proteomics technologies have seen a great progress in identifying disease biomarkers for clinical diagnosis and prognosis. However, there is a lack of effective feature selection methods that are able to capture essential data behaviors to achieve clinical level disease diagnosis. Moreover, it faces a challenge from data reproducibility, which means that no two independent studies have been found to produce same proteomic patterns. Such reproducibility issue causes the identified biomarker patterns to lose repeatability and prevents it from real clinical usage.MethodsIn this work, we propose a novel machine-learning algorithm: derivative component analysis (DCA) for high-dimensional mass spectral proteomic profiles. As an implicit feature selection algorithm, derivative component analysis examines input proteomics data in a multi-resolution approach by seeking its derivatives to capture latent data characteristics and conduct de-noising. We further demonstrate DCA's advantages in disease diagnosis by viewing input proteomics data as a profile biomarker via integrating it with support vector machines to tackle the reproducibility issue, besides comparing it with state-of-the-art peers.ResultsOur results show that high-dimensional proteomics data are actually linearly separable under proposed derivative component analysis (DCA). As a novel multi-resolution feature selection algorithm, DCA not only overcomes the weakness of the traditional methods in subtle data behavior discovery, but also suggests an effective resolution to overcoming proteomics data's reproducibility problem and provides new techniques and insights in translational bioinformatics and machine learning. The DCA-based profile biomarker diagnosis makes clinical level diagnostic performances reproducible across different proteomic data, which is more robust and systematic than the existing biomarker discovery based diagnosis.ConclusionsOur findings demonstrate the feasibility and power of the proposed DCA-based profile biomarker diagnosis in achieving high sensitivity and conquering the data reproducibility issue in serum proteomics. Furthermore, our proposed derivative component analysis suggests the subtle data characteristics gleaning and de-noising are essential in separating true signals from red herrings for high-dimensional proteomic profiles, which can be more important than the conventional feature selection or dimension reduction. In particular, our profile biomarker diagnosis can be generalized to other omics data for derivative component analysis (DCA)'s nature of generic data analysis.

[1]  John P A Ioannidis,et al.  Improving Validation Practices in “Omics” Research , 2011, Science.

[2]  J. N. Kapur,et al.  Entropy Optimization Principles and Their Applications , 1992 .

[3]  Henry Han,et al.  A high performance profile-biomarker diagnosis for mass spectral profiles , 2011, BMC Systems Biology.

[4]  Ying Wang,et al.  Identification of biomarkers for hepatocellular carcinoma using network-based bioinformatics methods , 2013, European Journal of Medical Research.

[5]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[6]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[7]  Stephen J. Callister,et al.  Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics. , 2006, Journal of proteome research.

[8]  Emanuel F. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004 .

[9]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[10]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[11]  Henry Han,et al.  Nonnegative principal component analysis for mass spectral serum profiles and biomarker discovery , 2010, BMC Bioinformatics.

[12]  Terri D Richmond,et al.  The current status and future potential of personalized diagnostics: Streamlining a customized process. , 2008, Biotechnology annual review.

[13]  Ruedi Aebersold,et al.  Reproducible Quantification of Cancer-Associated Proteins in Body Fluids Using Targeted Proteomics , 2012, Science Translational Medicine.

[14]  E. Roeb,et al.  Serum Proteome Profiling Identifies Novel and Powerful Markers of Cystic Fibrosis Liver Disease , 2013, PloS one.

[15]  Habtom W. Ressom,et al.  Peak selection from MALDI-TOF mass spectra using ant colony optimization , 2007, Bioinform..

[16]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Tony J. Parker,et al.  A Comparison of Methods for Classifying Clinical Samples Based on Proteomics Data: A Case Study for Statistical and Machine Learning Approaches , 2011, PloS one.

[18]  Gersende Fort,et al.  Classification using partial least squares with penalized logistic regression , 2005, Bioinform..

[19]  K. Baggerly,et al.  Different changes in protein and phosphoprotein levels result from serum starvation of high-grade glioma and adenocarcinoma cell lines. , 2010, Journal of proteome research.

[20]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[21]  Xiaoxu Han,et al.  Nonnegative Principal Component Analysis for Cancer Molecular Pattern Discovery , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Chih-Jen Lin,et al.  A Comparison of Methods for Multi-class Support Vector Machines , 2015 .

[23]  Bart J. A. Mertens,et al.  Biomarker discovery in MALDI-TOF serum protein profiles using discrete wavelet transformation , 2009, Bioinform..

[24]  Habtom W. Ressom,et al.  Analysis of mass spectral serum profiles for biomarker selection , 2005, Bioinform..

[25]  Melanie Hilario,et al.  Approaches to dimensionality reduction in proteomic biomarker studies , 2007, Briefings Bioinform..

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  Klaus Jung,et al.  Statistical methods for proteomics. , 2010, Methods in molecular biology.

[28]  S. Mallat A wavelet tour of signal processing , 1998 .

[29]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[30]  Xiaoli Li,et al.  Multi-resolution independent component analysis for high-performance tumor classification and biomarker discovery , 2011, BMC Bioinformatics.

[31]  E. Petricoin,et al.  Toxicoproteomics: Serum Proteomic Pattern Diagnostics for Early Detection of Drug Induced Cardiac Toxicities and Cardioprotection , 2004, Toxicologic pathology.

[32]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[33]  I. Jolliffe Principal Component Analysis , 2002 .

[34]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[35]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[36]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[37]  Jeffrey S. Morris,et al.  Serum proteomics profiling—a young technology begins to mature , 2005, Nature Biotechnology.

[38]  K. Mitsumori,et al.  Possible Mechanisms Underlying the Testicular Toxicity of Oxfendazole in Rats , 2004, Toxicologic pathology.

[39]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.