LATENT PROTEIN TREES

Unbiased, label-free proteomics is becoming a powerful technique for measuring protein expression in almost any biological sample. The output of these measurements after preprocessing is a collection of features and their associated intensities for each sample. Subsets of features within the data are from the same peptide, subsets of peptides are from the same protein, and subsets of proteins are in the same biological pathways, therefore, there is the potential for very complex and informative correlational structure inherent in these data. Recent attempts to utilize this data often focus on the identification of single features that are associated with a particular phenotype that is relevant to the experiment. However, to date, there have been no published approaches that directly model what we know to be multiple different levels of correlation structure. Here we present a hierarchical Bayesian model which is specifically designed to model such correlation structure in unbiased, label-free proteomics. This model utilizes partial identification information from peptide sequencing and database lookup as well as the observed correlation in the data to appropriately compress features into latent proteins and to estimate their correlation structure. We demonstrate the effectiveness of the model using artificial/benchmark data and in the context of a series of proteomics measurements of blood plasma from a collection of volunteers who were infected with two different strains of viral influenza.

[1]  J. Ord,et al.  Characterization Problems in Mathematical Statistics , 1975 .

[2]  C. J-F,et al.  THE COALESCENT , 1980 .

[3]  J. Kingman On the genealogy of large populations , 1982 .

[4]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[5]  M. West On scale mixtures of normal distributions , 1987 .

[6]  J. Leroy Folks,et al.  The Inverse Gaussian Distribution: Theory: Methodology, and Applications , 1988 .

[7]  Rick L. Edgeman The Inverse Gaussian Distribution: Theory, Methodology, and Applications , 1989 .

[8]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[9]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[10]  D. N. Perkins,et al.  Probability‐based protein identification by searching sequence databases using mass spectrometry data , 1999, Electrophoresis.

[11]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[12]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[13]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[14]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[15]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[16]  Sarah R. Edmonson,et al.  High-resolution serum proteomic patterns for ovarian cancer detection. , 2004, Endocrine-related cancer.

[17]  Emanuel F. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004 .

[18]  D. Chan,et al.  Cancer Proteomics: In Pursuit of “True” Biomarker Discovery , 2005, Cancer Epidemiology Biomarkers & Prevention.

[19]  Jeffrey T. Chang,et al.  GATHER: a systems approach to interpreting genomic signatures , 2006, Bioinform..

[20]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[21]  Yee Whye Teh,et al.  Bayesian Agglomerative Clustering with Coalescents , 2007, NIPS.

[22]  Lukas N. Mueller,et al.  SuperHirn – a novel tool for high resolution LC‐MS‐based peptide/protein profiling , 2007, Proteomics.

[23]  Navdeep Jaitly,et al.  DAnTE: a statistical tool for quantitative analysis of -omics data , 2008, Bioinform..

[24]  R. Service Proteomics Ponders Prime Time , 2008, Science.

[25]  K. Anderson,et al.  Mixed-effects statistical model for comparative LC-MS proteomics studies. , 2008, Journal of proteome research.

[26]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[27]  Peipei Ping,et al.  Getting to the heart of proteomics. , 2009, The New England journal of medicine.

[28]  Jianhua Huang,et al.  A statistical framework for protein quantitation in bottom-up MS-based proteomics , 2009, Bioinform..

[29]  Gunther Schadow,et al.  Protein quantification in label-free LC-MS experiments. , 2009, Journal of proteome research.

[30]  L. Carin,et al.  Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. , 2009, Cell host & microbe.

[31]  M. Bensebti,et al.  Statistical Model , 2005 .

[32]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[33]  Michael I. Jordan,et al.  Tree-Structured Stick Breaking for Hierarchical Data , 2010, NIPS.

[34]  Ole Winther,et al.  Sparse Linear Identifiable Multivariate Modeling , 2010, J. Mach. Learn. Res..

[35]  L. Carin,et al.  Predicting Viral Infection From High-Dimensional Biomarker Trajectories , 2011, Journal of the American Statistical Association.

[36]  Keyur Patel,et al.  Metaprotein expression modeling for label-free quantitative proteomics , 2012, BMC Bioinformatics.

[37]  Joseph E. Lucas,et al.  Efficient hierarchical clustering for continuous data , 2012 .

[38]  Lawrence Carin,et al.  Hierarchical factor modeling of proteomics data , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).