Statistical Methods for Proteomic Biomarker Discovery based on Feature Extraction or Functional Modeling Approaches.

In recent years, developments in molecular biotechnology have led to the increased promise of detecting and validating biomarkers, or molecular markers that relate to various biological or medical outcomes. Proteomics, the direct study of proteins in biological samples, plays an important role in the biomarker discovery process. These technologies produce complex, high dimensional functional and image data that present many analytical challenges that must be addressed properly for effective comparative proteomics studies that can yield potential biomarkers. Specific challenges include experimental design, preprocessing, feature extraction, and statistical analysis accounting for the inherent multiple testing issues. This paper reviews various computational aspects of comparative proteomic studies, and summarizes contributions I along with numerous collaborators have made. First, there is an overview of comparative proteomics technologies, followed by a discussion of important experimental design and preprocessing issues that must be considered before statistical analysis can be done. Next, the two key approaches to analyzing proteomics data, feature extraction and functional modeling, are described. Feature extraction involves detection and quantification of discrete features like peaks or spots that theoretically correspond to different proteins in the sample. After an overview of the feature extraction approach, specific methods for mass spectrometry (Cromwell) and 2D gel electrophoresis (Pinnacle) are described. The functional modeling approach involves modeling the proteomic data in their entirety as functions or images. A general discussion of the approach is followed by the presentation of a specific method that can be applied, wavelet-based functional mixed models, and its extensions. All methods are illustrated by application to two example proteomic data sets, one from mass spectrometry and one from 2D gel electrophoresis. While the specific methods presented are applied to two specific proteomic technologies, MALDI-TOF and 2D gel electrophoresis, these methods and the other principles discussed in the paper apply much more broadly to other expression proteomics technologies.

[1]  Jeffrey S. Morris,et al.  Evaluating the performance of new approaches to spot quantification and differential expression in 2-dimensional gel electrophoresis studies. , 2010, Journal of proteome research.

[2]  Jeffrey S. Morris,et al.  Pinnacle: a fast, automatic and accurate method for detecting and quantifying protein spots in 2-dimensional gel electrophoresis data , 2008, Bioinform..

[3]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[4]  M. Mann,et al.  Electrospray Ionization for Mass Spectrometry of Large Biomolecules , 1990 .

[5]  Emanuel F. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004 .

[6]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[7]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[8]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[9]  Werner Dubitzky,et al.  Fundamentals of Data Mining in Genomics and Proteomics , 2009 .

[10]  Susmita Datta,et al.  Empirical Bayes screening of many p-values with applications to microarray studies , 2005, Bioinform..

[11]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[12]  J. S. Rao,et al.  Detecting Differentially Expressed Genes in Microarrays Using Bayesian Model Selection , 2003 .

[13]  Jeffrey S. Morris,et al.  Robust Classification of Functional and Quantitative Image Data Using Functional Mixed Models , 2012, Biometrics.

[14]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[15]  J. Marron,et al.  The high-dimension, low-sample-size geometric representation holds under mild conditions , 2007 .

[16]  Korbinian Strimmer,et al.  A unified approach to false discovery rate estimation , 2008, BMC Bioinformatics.

[17]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[18]  Jeffrey S. Morris,et al.  Bayesian Analysis of Mass Spectrometry Proteomic Data Using Wavelet‐Based Functional Mixed Models , 2008, Biometrics.

[19]  Marina Vannucci,et al.  Wavelet-Based Nonparametric Modeling of Hierarchical Functions in Colon Carcinogenesis , 2003 .

[20]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[21]  Guang-Zhong Yang,et al.  Gene expression Automated image alignment for 2 D gel electrophoresis in a high-throughput proteomics pipeline , 2008 .

[22]  Jeffrey S. Morris,et al.  PrepMS: TOF MS data graphical preprocessing tool , 2007, Bioinform..

[23]  D. Paul ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMENSIONAL SPIKED COVARIANCE MODEL , 2007 .

[24]  Nouna Kettaneh,et al.  Statistical Modeling by Wavelets , 1999, Technometrics.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[27]  Jeffrey S. Morris,et al.  Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum , 2005, Bioinform..

[28]  Jeffrey S. Morris,et al.  Bias, Randomization, and Ovarian Proteomic Data: A Reply to “Producers and Consumers” , 2005, Cancer informatics.

[29]  L H Parsons,et al.  Serotonin dysfunction in the nucleus accumbens of rats during withdrawal after unlimited access to intravenous cocaine. , 1995, The Journal of pharmacology and experimental therapeutics.

[30]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[31]  Jeffrey S. Morris,et al.  Statistical contributions to proteomic research. , 2010, Methods in molecular biology.

[32]  Lukas N. Mueller,et al.  An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. , 2008, Journal of proteome research.

[33]  Guang-Zhong Yang,et al.  Informatics and Statistics for Analyzing 2-D Gel Electrophoresis Images , 2010, Proteome Bioinformatics.

[34]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[35]  Jie Chen,et al.  A Bayesian determination of threshold for identifying differentially expressed genes in microarray experiments , 2006, Statistics in medicine.

[36]  Kevin R Coombes,et al.  Plasma protein profiling for diagnosis of pancreatic cancer reveals the presence of host response proteins. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[37]  E. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004, Endocrine-related cancer.

[38]  M. Karas,et al.  Matrix-assisted ultraviolet laser desorption of non-volatile compounds , 1987 .

[39]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[40]  R. Twyman Principles of Proteomics , 2013 .

[41]  B. Silverman,et al.  Functional Data Analysis , 1997 .

[42]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[43]  J. Griffin,et al.  Alternative prior distributions for variable selection with very many more variables than observations , 2005 .

[44]  Ana-Maria Staicu,et al.  Generalized Multilevel Functional Regression , 2009, Journal of the American Statistical Association.

[45]  Sarah R. Edmonson,et al.  High-resolution serum proteomic patterns for ovarian cancer detection. , 2004, Endocrine-related cancer.

[46]  John D. Storey A direct approach to false discovery rates , 2002 .

[47]  C. Sitthi-amorn,et al.  Bias , 1993, The Lancet.

[48]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[49]  Jeffrey S. Morris,et al.  Improved peak detection and quantification of mass spectrometry data acquired from surface‐enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform , 2005, Proteomics.

[50]  Henry H. N. Lam,et al.  Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. , 2008, Physiological genomics.

[51]  Jeffrey S. Morris,et al.  Reproducibility of SELDI-TOF protein patterns in serum: comparing datasets from different experiments , 2004, Bioinform..

[52]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[53]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[54]  Kathryn S Lilley,et al.  Maximising sensitivity for detecting changes in protein expression: Experimental design using minimal CyDyes , 2005, Proteomics.

[55]  Sujit K. Ghosh,et al.  Essential Wavelets for Statistical Applications and Data Analysis , 2001, Technometrics.

[56]  Hongxiao Zhu,et al.  Robust, Adaptive Functional Regression in Functional Mixed Model Framework , 2011, Journal of the American Statistical Association.

[57]  G F Koob,et al.  Transition from moderate to excessive drug intake: change in hedonic set point. , 1998, Science.

[58]  L. Liotta,et al.  High-resolution serum proteomic patterns for ovarian cancer detection , 2004 .

[59]  Y. Benjamini,et al.  A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence , 1999 .

[60]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[61]  S. Gygi,et al.  Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Jeffrey S. Morris,et al.  Wavelet‐based functional mixed models , 2006, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[63]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[64]  Jeffrey S. Morris,et al.  AUTOMATED ANALYSIS OF QUANTITATIVE IMAGE DATA USING ISOMORPHIC FUNCTIONAL MIXED MODELS, WITH APPLICATION TO PROTEOMICS DATA. , 2011, The annals of applied statistics.

[65]  N. Sherman,et al.  Protein Sequencing and Identification Using Tandem Mass Spectrometry: Kinter/Tandem Mass Spectrometry , 2000 .

[66]  Jeffrey S. Morris,et al.  Wavelet-based functional mixed model analysis: Computational considerations , 2006 .

[67]  S. Péché,et al.  Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices , 2004, math/0403022.

[68]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[69]  Brittan N Clark,et al.  The myth of automated, high‐throughput two‐dimensional gel analysis , 2008, Proteomics.

[70]  Guang-Zhong Yang,et al.  Image analysis tools and emerging algorithms for expression proteomics , 2010, Proteomics.

[71]  P. O’Farrell High resolution two-dimensional electrophoresis of proteins. , 1975, The Journal of biological chemistry.

[72]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[73]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[74]  Jeffrey S. Morris,et al.  The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. , 2005, Briefings in functional genomics & proteomics.

[75]  Guang-Zhong Yang,et al.  Image Analysis Tools in Proteomics , 2011 .