Guilt-by-association feature selection: Identifying biomarkers from proteomic profiles

In recent years, proteomic profiling by mass spectrometry has opened up a new realm of methods for identifying potential biomarkers. Mass spectrometry data, like other proteomic and genomic data, are challenging to analyze because of their high dimensionality and the availability of few samples. Hence, feature selection is extremely important because it directly provides a list of potential biomarkers by choosing a subset of effective features that separate diseased samples from healthy ones. The rule of thumb for feature selection is that features must be discriminant and independent for the best separation of the two groups. However, in general, existing feature selection algorithms only take into account the discrimination ability of features. In this paper, we present a novel method for feature selection, guilt-by-association feature selection (GBA-FS). The algorithm makes it possible to select features that are independent as well as discriminant. After measuring similarities between features, the algorithm groups together similar features using a clustering algorithm, and selects the best representative feature from each group. As a result, it produces a list of discriminant and independent features. The efficacy of GBA-FS was extensively tested on two real-world SELDI TOF data sets. The experimental results demonstrate that GBA-FS assists in selecting more independent features as compared to a common filter type feature selection method, the t test. The results also show that GBA-FS can be used to deconvolve multiply charged states of the same protein molecules. As GBA-FS successfully identifies feature groups with similar mass values, it can also be employed as an alternative to peak detection for preprocessing the mass spectrometry data.

[1]  M. Daly,et al.  Guilt by association , 2000, Nature Genetics.

[2]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[3]  Neal O. Jeffries,et al.  Performance of a genetic algorithm for mass spectrometry proteomics , 2004, BMC Bioinformatics.

[4]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .

[6]  K. Baggerly,et al.  Pharmacoproteomic analysis of prechemotherapy and postchemotherapy plasma samples from patients receiving neoadjuvant or adjuvant chemotherapy for breast carcinoma , 2004, Cancer.

[7]  Atul J. Butte,et al.  Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks , 2005, BMC Bioinformatics.

[8]  Subhash Sharma Applied multivariate techniques , 1995 .

[9]  T. Majtan,et al.  Transcriptional profiling of bacteriophage BFK20: coexpression interrogated by "guilt-by-association" algorithm. , 2007, Virology.

[10]  Jeffrey S. Morris,et al.  Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization. , 2003, Clinical chemistry.

[11]  D. Chan,et al.  Serum Diagnosis of Pancreatic Adenocarcinoma Using Surface-Enhanced Laser Desorption and Ionization Mass Spectrometry , 2004, Clinical Cancer Research.

[12]  T. Kang,et al.  Pattern analysis of serum proteome distinguishes renal cell carcinoma from other urologic diseases and healthy persons , 2003, Proteomics.

[13]  Dominique Schols,et al.  Diverging binding capacities of natural LD78β isoforms of macrophage inflammatory protein‐1α to the CC chemokine receptors 1, 3 and 5 affect their anti‐HIV‐1 activity and chemotactic potencies for neutrophils and eosinophils , 2001, European journal of immunology.

[14]  E. Petricoin,et al.  Early detection: Proteomic applications for the early detection of cancer , 2003, Nature Reviews Cancer.

[15]  K. Kozak,et al.  Identification of biomarkers for ovarian cancer using strong anion-exchange ProteinChips: Potential use in diagnosis and prognosis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[17]  Eric R Siegel,et al.  Diagnosis of pancreatic cancer using serum proteomic profiling. , 2004, Neoplasia.

[18]  M. Ferrari,et al.  Clinical proteomics: Written in blood , 2003, Nature.

[19]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[20]  D. Chan,et al.  Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. , 2002, Clinical chemistry.

[21]  Florian J Schweigert,et al.  Characterization of the microheterogeneity of transthyretin in plasma and urine using SELDI-TOF-MS immunoassay , 2004, Proteome Science.

[22]  Robert Tibshirani,et al.  Sample classification from protein mass spectrometry, by 'peak probability contrasts' , 2004, Bioinform..

[23]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[24]  Melanie Hilario,et al.  Mining mass spectra for diagnosis and biomarker discovery of cerebral accidents , 2004, Proteomics.

[25]  Mia K. Markey,et al.  Guilt-By-Association Feature Selection Applied to Simulated Proteomic Data , 2005, AMIA.

[26]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[27]  Xin Wu,et al.  GBA server: EST-based digital gene expression profiling , 2005, Nucleic Acids Res..

[28]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[29]  John Quackenbush Microarrays--Guilt by Association , 2003, Science.

[30]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[31]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[32]  E. Sprinzak,et al.  Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. , 1999, Genome research.

[33]  Mia K. Markey,et al.  A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples , 2006, J. Biomed. Informatics.

[34]  S. Quake,et al.  Identification and confirmation of a module of coexpressed genes. , 2002, Genome research.

[35]  Hong Tang,et al.  Data mining techniques for cancer detection using serum proteomic profiling , 2004, Artif. Intell. Medicine.