A novel approach for clustering proteomics data using Bayesian fast Fourier transform

MOTIVATION Bioinformatics clustering tools are useful at all levels of proteomic data analysis. Proteomics studies can provide a wealth of information and rapidly generate large quantities of data from the analysis of biological specimens. The high dimensionality of data generated from these studies requires the development of improved bioinformatics tools for efficient and accurate data analyses. For proteome profiling of a particular system or organism, a number of specialized software tools are needed. Indeed, significant advances in the informatics and software tools necessary to support the analysis and management of these massive amounts of data are needed. Clustering algorithms based on probabilistic and Bayesian models provide an alternative to heuristic algorithms. The number of clusters (diseased and non-diseased groups) is reduced to the choice of the number of components of a mixture of underlying probability. The Bayesian approach is a tool for including information from the data to the analysis. It offers an estimation of the uncertainties of the data and the parameters involved. RESULTS We present novel algorithms that can organize, cluster and derive meaningful patterns of expression from large-scaled proteomics experiments. We processed raw data using a graphical-based algorithm by transforming it from a real space data-expression to a complex space data-expression using discrete Fourier transformation; then we used a thresholding approach to denoise and reduce the length of each spectrum. Bayesian clustering was applied to the reconstructed data. In comparison with several other algorithms used in this study including K-means, (Kohonen self-organizing map (SOM), and linear discriminant analysis, the Bayesian-Fourier model-based approach displayed superior performances consistently, in selecting the correct model and the number of clusters, thus providing a novel approach for accurate diagnosis of the disease. Using this approach, we were able to successfully denoise proteomic spectra and reach up to a 99% total reduction of the number of peaks compared to the original data. In addition, the Bayesian-based approach generated a better classification rate in comparison with other classification algorithms. This new finding will allow us to apply the Fourier transformation for the selection of the protein profile for each sample, and to develop a novel bioinformatic strategy based on Bayesian clustering for biomarker discovery and optimal diagnosis.

[1]  F Hillenkamp,et al.  Matrix-assisted laser desorption/ionization mass spectrometry of biopolymers. , 1991, Analytical chemistry.

[2]  Jean-Charles Sanchez,et al.  Proteomics: new perspectives, new biomedical opportunities , 2000, The Lancet.

[3]  H. Bensmail,et al.  Postgenomics: Proteomics and Bioinformatics in Cancer Research , 2003, Journal of biomedicine & biotechnology.

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  A. D. Gordon,et al.  Classification : Methods for the Exploratory Analysis of Multivariate Data , 1981 .

[6]  Nicholas,et al.  Engineering Advances : New Opportunities for Biomedical Engineers , 2022 .

[7]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[8]  Ulrich Menzefricke,et al.  Bayesian clustering of data sets , 1981 .

[9]  O John Semmes,et al.  Serum Protein Profiles to Identify Head and Neck Cancer , 2004, Clinical Cancer Research.

[10]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[11]  H. Bozdogan Choosing the Number of Component Clusters in the Mixture-Model Using a New Informational Complexity Criterion of the Inverse-Fisher Information Matrix , 1993 .

[12]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[13]  Andrea Cerioli,et al.  Functional Cluster Analysis of Financial Time Series , 2005 .

[14]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[15]  J. Hartigan,et al.  Percentage Points of a Test for Clusters , 1969 .

[16]  J. Wolfe,et al.  Comparative Cluster Analysis Of Patterns Of Vocational Interest. , 1978, Multivariate behavioral research.

[17]  G. J. Babu,et al.  Three Types of Gamma-Ray Bursts , 1998, astro-ph/9802085.

[18]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[19]  O John Semmes,et al.  The HTLV-1 tax oncoprotein attenuates DNA damage induced G1 arrest and enhances apoptosis in p53 null cells. , 2003, Virology.

[20]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[22]  Hans-Hermann Bock,et al.  Probabilistic Models in Partitional Cluster Analysis , 2003 .

[23]  David A. Binder,et al.  Approximations to Bayesian clustering rules , 1981 .

[24]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[25]  H. Bock Probabilistic models in cluster analysis , 1996 .

[26]  H. Bozdogan,et al.  Akaike's Information Criterion and Recent Developments in Information Complexity. , 2000, Journal of mathematical psychology.

[27]  M. Karas,et al.  Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. , 1988, Analytical chemistry.

[28]  Adrian E. Raftery,et al.  Fitting straight lines to point patterns , 1984, Pattern Recognit..

[29]  D. Haughton,et al.  Informational complexity criteria for regression models , 1998 .

[30]  S. Weinberger,et al.  Recent advancements in surface‐enhanced laser desorption/ionization‐time of flight‐mass spectrometry , 2000, Electrophoresis.

[31]  O John Semmes,et al.  Protein profiling of urine in the diagnosis of bladder cancer , 2005, Nature Clinical Practice Urology.

[32]  Otto Opitz,et al.  Information and Classification , 1993 .

[33]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[34]  A. Raftery,et al.  Estimating Bayes Factors via Posterior Simulation with the Laplace—Metropolis Estimator , 1997 .

[35]  T. Yip,et al.  New desorption strategies for the mass spectrometric analysis of macromolecules , 1993 .

[36]  B. Silverman,et al.  Some Aspects of the Spline Smoothing Approach to Non‐Parametric Regression Curve Fitting , 1985 .

[37]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[38]  I. Johnstone,et al.  Threshold selection for wavelet shrinkage of noisy data , 1994, Proceedings of 16th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[39]  H. Bock On some significance tests in cluster analysis , 1985 .

[40]  D. Chan,et al.  Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. , 2002, Clinical chemistry.

[41]  L. Wasserman,et al.  Practical Bayesian Density Estimation Using Mixtures of Normals , 1997 .

[42]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[43]  Adrian E. Raftery,et al.  Inference in model-based cluster analysis , 1997, Stat. Comput..

[44]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[45]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[46]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[47]  G. Wright,et al.  Proteinchip® surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures , 1999, Prostate Cancer and Prostatic Diseases.

[48]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[49]  O John Semmes,et al.  Normal, benign, preneoplastic, and malignant prostate cells have distinct protein expression profiles resolved by surface enhanced laser desorption/ionization mass spectrometry. , 2002, Clinical cancer research : an official journal of the American Association for Cancer Research.

[50]  L. M. M.-T. Theory of Probability , 1929, Nature.

[51]  O John Semmes,et al.  Human T-cell Leukemia Virus-I Tax Oncoprotein Functionally Targets a Subnuclear Complex Involved in Cellular DNA Damage-Response* , 2003, Journal of Biological Chemistry.

[52]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[53]  A. Gelfand,et al.  Bayesian Model Choice: Asymptotics and Exact Calculations , 1994 .

[54]  A. Raftery,et al.  Detecting features in spatial point processes with clutter via model-based clustering , 1998 .

[55]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[56]  R. Pearl Biometrics , 1914, The American Naturalist.