Recursive SVM biomarker selection for early detection of breast cancer in peripheral blood

BackgroundBreast cancer is worldwide the second most common type of cancer after lung cancer. Traditional mammography and Tissue Microarray has been studied for early cancer detection and cancer prediction. However, there is a need for more reliable diagnostic tools for early detection of breast cancer. This can be a challenge due to a number of factors and logistics. First, obtaining tissue biopsies can be difficult. Second, mammography may not detect small tumors, and is often unsatisfactory for younger women who typically have dense breast tissue. Lastly, breast cancer is not a single homogeneous disease but consists of multiple disease states, each arising from a distinct molecular mechanism and having a distinct clinical progression path which makes the disease difficult to detect and predict in early stages.ResultsIn the paper, we present a Support Vector Machine based on Recursive Feature Elimination and Cross Validation (SVM-RFE-CV) algorithm for early detection of breast cancer in peripheral blood and show how to use SVM-RFE-CV to model the classification and prediction problem of early detection of breast cancer in peripheral blood.The training set which consists of 32 health and 33 cancer samples and the testing set consisting of 31 health and 34 cancer samples were randomly separated from a dataset of peripheral blood of breast cancer that is downloaded from Gene Express Omnibus. First, we identified the 42 differentially expressed biomarkers between "normal" and "cancer". Then, with the SVM-RFE-CV we extracted 15 biomarkers that yield zero cross validation score. Lastly, we compared the classification and prediction performance of SVM-RFE-CV with that of SVM and SVM Recursive Feature Elimination (SVM-RFE).ConclusionsWe found that 1) the SVM-RFE-CV is suitable for analyzing noisy high-throughput microarray data, 2) it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features, and 3) it can improve the prediction performance (Area Under Curve) in the testing data set from 0.5826 to 0.7879. Further pathway analysis showed that the biomarkers are associated with Signaling, Hemostasis, Hormones, and Immune System, which are consistent with previous findings. Our prediction model can serve as a general model for biomarker discovery in early detection of other cancers. In the future, Polymerase Chain Reaction (PCR) is planned for validation of the ability of these potential biomarkers for early detection of breast cancer.

[1]  Mahlon D. Johnson,et al.  Elevated content of the tyrosine kinase substrate phospholipase C-gamma 1 in primary human breast carcinomas. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[2]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[3]  M. Jett,et al.  Expression patterns of fatty acid binding proteins in breast cancer cells. , 2005, Journal of experimental therapeutics & oncology.

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  R. Elashoff,et al.  Elevated levels of connective tissue growth factor, WISP-1, and CYR61 in primary breast cancers associated with more advanced features. , 2001, Cancer research.

[6]  Robert Tibshirani,et al.  Fataluku medicinal ethnobotany and the East Timorese military resistance , 2007, Journal of ethnobiology and ethnomedicine.

[7]  A. Børresen-Dale,et al.  Gene expression profiling of peripheral blood cells for early detection of breast cancer , 2010, Breast Cancer Research.

[8]  Fan Zhang,et al.  Discovery of pathway biomarkers from coupled proteomics and systems biology methods , 2010, BMC Genomics.

[9]  Kornelia Polyak,et al.  Breast cancer: origins and evolution. , 2007, The Journal of clinical investigation.

[10]  M. Jett,et al.  Adipocyte-fatty acid binding protein induces apoptosis in DU145 prostate cancer cells. , 2004, Journal of experimental therapeutics & oncology.

[11]  G. Watkins,et al.  Differential Expression and Prognostic Implications of the CCN Family Members WISP-1, WISP-2, and WISP-3 in Human Breast Cancer , 2007, Annals of Surgical Oncology.

[12]  Florence Le Calvez-Kelm,et al.  Methylome analysis reveals Jak-STAT pathway deregulation in putative breast cancer stem cells , 2011, Epigenetics.

[13]  Jiahuai Han,et al.  Nod1-dependent control of tumor growth. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Fan Zhang,et al.  IPAD: the Integrated Pathway Analysis Database for Systematic Enrichment Analysis , 2012, BMC Bioinformatics.

[15]  Fan Zhang,et al.  A neural network approach to multi-biomarker panel development based on LC/MS/MS proteomics profiles: A case study in breast cancer , 2009, 2009 22nd IEEE International Symposium on Computer-Based Medical Systems.

[16]  S. Iacobelli,et al.  Phospholipase Cgamma1 is required for metastasis development and progression. , 2008, Cancer research.

[17]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .