Examining the significance of fingerprint-based classifiers

BackgroundExperimental examinations of biofluids to measure concentrations of proteins or their fragments or metabolites are being explored as a means of early disease detection, distinguishing diseases with similar symptoms, and drug treatment efficacy. Many studies have produced classifiers with a high sensitivity and specificity, and it has been argued that accurate results necessarily imply some underlying biology-based features in the classifier. The simplest test of this conjecture is to examine datasets designed to contain no information with classifiers used in many published studies.ResultsThe classification accuracy of two fingerprint-based classifiers, a decision tree (DT) algorithm and a medoid classification algorithm (MCA), are examined. These methods are used to examine 30 artificial datasets that contain random concentration levels for 300 biomolecules. Each dataset contains between 30 and 300 Cases and Controls, and since the 300 observed concentrations are randomly generated, these datasets are constructed to contain no biological information. A modest search of decision trees containing at most seven decision nodes finds a large number of unique decision trees with an average sensitivity and specificity above 85% for datasets containing 60 Cases and 60 Controls or less, and for datasets with 90 Cases and 90 Controls many DTs have an average sensitivity and specificity above 80%. For even the largest dataset (300 Cases and 300 Controls) the MCA procedure finds several unique classifiers that have an average sensitivity and specificity above 88% using only six or seven features.ConclusionWhile it has been argued that accurate classification results must imply some biological basis for the separation of Cases from Controls, our results show that this is not necessarily true. The DT and MCA classifiers are sufficiently flexible and can produce good results from datasets that are specifically constructed to contain no information. This means that a chance fitting to the data is possible. All datasets used in this investigation are available on the web.This work is funded by NCI Contract N01-CO-12400.

[1]  E. Petricoin,et al.  High-resolution serum proteomic features for ovarian cancer detection. , 2004, Endocrine-related cancer.

[2]  M. Eberhardson,et al.  The use of proteomics in identifying differentially expressed serum proteins in humans with type 2 diabetes , 2006, Proteome Science.

[3]  B. Hitt,et al.  Low molecular weight proteomic information distinguishes metastatic from benign pheochromocytoma. , 2005, Endocrine-related cancer.

[4]  Philip J Day,et al.  Artificial neural networks and decision tree model analysis of liver cancer proteomes. , 2007, Biochemical and biophysical research communications.

[5]  D. Ward,et al.  Identification of serum biomarkers for colon cancer by proteomic analysis , 2006, British Journal of Cancer.

[6]  U. Langsenlehner,et al.  A multigenic approach to predict breast cancer risk , 2007, Breast Cancer Research and Treatment.

[7]  N. Anderson,et al.  The Human Plasma Proteome , 2002, Molecular & Cellular Proteomics.

[8]  Emanuel F Petricoin,et al.  Serum proteomic profiling can discriminate prostate cancer from benign prostates in men with total prostate specific antigen levels between 2.5 and 15.0 ng/ml. , 2004, The Journal of urology.

[9]  E. Petricoin,et al.  A serum proteomic approach to gauging the state of remission in Wegener's granulomatosis. , 2005, Arthritis and rheumatism.

[10]  Daniel Hartmann,et al.  Identification of Potential Markers for the Detection of Pancreatic Cancer Through Comparative Serum Protein Expression Profiling , 2007, Pancreas.

[11]  D. Ransohoff Lessons from controversy: ovarian cancer screening and serum proteomics. , 2005, Journal of the National Cancer Institute.

[12]  C. Infante-Rivard,et al.  Unexpected Relationship between Plasma Homocysteine and Intrauterine Growth Restriction , 2004 .

[13]  Klaus-Robert Müller,et al.  Optimal dyadic decision trees , 2007, Machine Learning.

[14]  Virginia Espina,et al.  Accurate diagnosis of acute graft-versus-host disease using serum proteomic pattern analysis. , 2006, Experimental hematology.

[15]  J. M. Roman,et al.  Increased serum levels of complement C3a anaphylatoxin indicate the presence of colorectal tumors. , 2006, Gastroenterology.

[16]  Weijian Guo,et al.  Prediction of Pancreatic Cancer by Serum Biomarkers Using Surface-Enhanced Laser Desorption/Ionization-Based Decision Tree Classification , 2005, Oncology.

[17]  Ming Xu,et al.  Using tree analysis pattern and SELDI-TOF-MS to discriminate transitional cell carcinoma of the bladder cancer from noncancer patients. , 2005, European urology.

[18]  E. Petricoin,et al.  Toxicoproteomics: Serum Proteomic Pattern Diagnostics for Early Detection of Drug Induced Cardiac Toxicities and Cardioprotection , 2004, Toxicologic pathology.

[19]  D. Ransohoff Bias as a threat to the validity of cancer molecular-marker research , 2005, Nature reviews. Cancer.

[20]  Wei Zhang,et al.  Application of serum SELDI proteomic patterns in diagnosis of lung cancer , 2005, BMC Cancer.

[21]  Feng-ping Huang,et al.  Discovery of serum biomarkers in astrocytoma by SELDI–TOF MS and proteinchip technology , 2007, Journal of Neuro-Oncology.

[22]  Emanuel F. Petricoin,et al.  Serum Proteomic Analysis Identifies a Highly Sensitive and Specific Discriminatory Pattern in Stage 1 Breast Cancer , 2007, Annals of Surgical Oncology.

[23]  Brian T. Luke,et al.  Genetic algorithms and beyond , 2003 .

[24]  William E Grizzle,et al.  Clarification in the point/counterpoint discussion related to surface-enhanced laser desorption/ionization time-of-flight mass spectrometric identification of patients with adenocarcinomas of the prostate. , 2004, Clinical chemistry.

[25]  Steven A Carr,et al.  Place of pattern in proteomic biomarker discovery. , 2005, Journal of proteome research.

[26]  Nico Nagelkerke,et al.  Developing a Discrimination Rule between Breast Cancer Patients and Controls Using Proteomics Mass Spectrometric Data: A Three-Step Approach , 2008, Statistical applications in genetics and molecular biology.

[27]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[28]  Bao-xue Yang,et al.  Application of surface-enhanced laser desorption/ionization time-of-flight-based serum proteomic array technique for the early diagnosis of prostate cancer. , 2006, Asian journal of andrology.

[29]  E. Petricoin,et al.  Putting the "bio" back into biomarkers: orienting proteomic discovery toward biology and away from the measurement platform. , 2008, Clinical chemistry.

[30]  E. Petricoin,et al.  The blood peptidome: a higher dimension of information content for cancer biomarker discovery , 2006, Nature Reviews Cancer.

[31]  Wen-Qi Jiang,et al.  Serum diagnosis of diffuse large B-cell lymphomas and further identification of response to therapy using SELDI-TOF-MS and tree analysis patterning , 2007, BMC Cancer.

[32]  D. Kwong,et al.  Surface‐enhanced laser desorption/ionization time‐of‐flight mass spectrometry serum protein profiling to identify nasopharyngeal carcinoma , 2006, Cancer.

[33]  P. Schellhammer,et al.  Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. , 2002, Cancer research.

[34]  Andrew Kusiak,et al.  Cancer gene search with data-mining and genetic algorithms , 2007, Comput. Biol. Medicine.