A CBR framework with gradient boosting based feature selection for lung cancer subtype classification

Molecular subtype classification represents a challenging field in lung cancer diagnosis. Although different methods have been proposed for biomarker selection, efficient discrimination between adenocarcinoma and squamous cell carcinoma in clinical practice presents several difficulties, especially when the latter is poorly differentiated. This is an area of growing importance, since certain treatments and other medical decisions are based on molecular and histological features. An urgent need exists for a system and a set of biomarkers that provide an accurate diagnosis. In this paper, a novel Case Based Reasoning framework with gradient boosting based feature selection is proposed and applied to the task of squamous cell carcinoma and adenocarcinoma discrimination, aiming to provide accurate diagnosis with a reduced set of genes. The proposed method was trained and evaluated on two independent datasets to validate its generalization capability. Furthermore, it achieved accuracy rates greater than those of traditional microarray analysis techniques, incorporating the advantages inherent to the Case Based Reasoning methodology (e.g. learning over time, adaptability, interpretability of solutions, etc.).

[1]  Arvind Kumar Tiwari,et al.  A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction , 2015 .

[2]  Il-Jin Kim,et al.  Rewiring of human lung cell lineage and mitotic networks in lung adenocarcinomas , 2013, Nature Communications.

[3]  Raheleh Salari,et al.  Usage of Case Based Reasoning in Health Sciences , 2013 .

[4]  C. Claussen,et al.  Positron Emission Tomography/Computed Tomography and Whole-Body Magnetic Resonance Imaging in Staging of Advanced Nonsmall Cell Lung Cancer—Initial Results , 2008, Investigative radiology.

[5]  Hairong Qi Feature Selection and kNN Fusion in Molecular Classification of Multiple Tumor Types , .

[6]  E-S Lee,et al.  AKR1B10 is Associated with Smoking and Smoking-Related Non-Small-Cell Lung Cancer , 2011, The Journal of international medical research.

[7]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[8]  B. Sarmento,et al.  Non-Small Cell Lung Carcinoma: An Overview on Targeted Therapy. , 2015, Current drug targets.

[9]  Igor Jurisica,et al.  Maintaining Case-Based Reasoning Systems: A Machine Learning Approach , 2004, ECCBR.

[10]  Manuel Glez Bedia,et al.  Multiple-Microarray Analysis and Internet Gathering Information with Application for Aiding Medical Diagnosis in Cancer Research , 2008, IWPACBB.

[11]  Terry Windeatt,et al.  Relevant and Redundant Feature Analysis with Ensemble Classification , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[12]  Juan M. Corchado,et al.  Identification of informative genes and pathways using an improved penalized support vector machine with a weighting scheme , 2016, Comput. Biol. Medicine.

[13]  Xiuwei Zhang,et al.  Refining transcriptional regulatory networks using network evolutionary models and gene histories , 2010, Algorithms for Molecular Biology.

[14]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[15]  Walter L. Ruzzo,et al.  Improved Gene Selection for Classification of Microarrays , 2002, Pacific Symposium on Biocomputing.

[16]  Jing Zhou,et al.  Streaming Feature Selection using IIC , 2005, AISTATS.

[17]  Kilian Q. Weinberger,et al.  Gradient boosted feature selection , 2014, KDD.

[18]  Igor Jurisica,et al.  Applications of Case-Based Reasoning in Molecular Biology , 2004, AI Mag..

[19]  Klaus Hechenbichler,et al.  Weighted k-Nearest-Neighbor Techniques and Ordinal Classification , 2004 .

[20]  Mats Lambe,et al.  Biomarker Discovery in Non–Small Cell Lung Cancer: Integrating Gene Expression Profiling, Meta-analysis, and Tissue Microarray Validation , 2012, Clinical Cancer Research.

[21]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[22]  Hiroyuki Aburatani,et al.  Overexpression of the Aldo-Keto Reductase Family Protein AKR1B10 Is Highly Correlated with Smokers' Non–Small Cell Lung Carcinomas , 2005, Clinical Cancer Research.

[23]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[24]  D. Mu,et al.  Tight junction proteins: from barrier to tumorigenesis. , 2013, Cancer letters.

[25]  Hui-Yun Wang,et al.  TRIM29 overexpression is associated with poor prognosis and promotes tumor progression by activating Wnt/β-catenin pathway in cervical cancer , 2016, Oncotarget.

[26]  Yun Lu,et al.  Evidence for type II cells as cells of origin of K-Ras–induced distal lung adenocarcinoma , 2012, Proceedings of the National Academy of Sciences.

[27]  Rafael Rosell,et al.  Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer , 2011, International journal of cancer.

[28]  Agnar Aamodt,et al.  Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches , 1994, AI Commun..

[29]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[30]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[31]  C. Zhan,et al.  Identification of immunohistochemical markers for distinguishing lung adenocarcinoma from squamous cell carcinoma. , 2015, Journal of thoracic disease.

[32]  Chunxiao Liu,et al.  Silencing of tripartite motif (TRIM) 29 inhibits proliferation and invasion and increases chemosensitivity to cisplatin in human lung squamous cancer NCI-H520 cells , 2015, Thoracic cancer.

[33]  Hassan Naderi,et al.  A Survey on Nearest Neighbor Search Methods , 2014 .

[34]  Nancy Bretschneider,et al.  SFTA2—A Novel Secretory Peptide Highly Expressed in the Lung—Is Modulated by Lipopolysaccharide but Not Hyperoxia , 2012, PloS one.

[35]  Kenji Suzuki,et al.  Novel biomarkers that assist in accurate discrimination of squamous cell carcinoma from adenocarcinoma of the lung , 2016, BMC Cancer.

[36]  Juan M. Corchado,et al.  Improving Gene Selection in Microarray Data Analysis Using Fuzzy Patterns Inside a CBR System , 2005, ICCBR.

[37]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[38]  Holger Sültmann,et al.  Global gene expression analysis reveals specific patterns of cell junctions in non-small cell lung cancer subtypes. , 2009, Lung cancer.

[39]  Mei-Ling Huang,et al.  Usage of Case-Based Reasoning, Neural Network and Adaptive Neuro-Fuzzy Inference System Classification Techniques in Breast Cancer Dataset Classification Diagnosis , 2012, Journal of Medical Systems.

[40]  Bangpeng Yao,et al.  ANMM4CBR: a case-based reasoning method for gene expression data classification , 2010, Algorithms for Molecular Biology.

[41]  A. Veral,et al.  The Value of Cytokeratin 5/6, p63 and Thyroid Transcription Factor-1 in Adenocarcinoma, Squamous Cell Carcinoma and Non-Small-Cell Lung Cancer of the Lung / Akciğerin Adenokarsinom, Skuamöz Hücreli Karsinom ve Küçük Hücreli Dışı Akciğer Kanserlerinde Sitokeratin 5/6, p63 ve TTF-1’in Değeri , 2015, Turk patoloji dergisi.

[42]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[43]  K. Strimmer,et al.  Feature selection in omics prediction problems using cat scores and false nondiscovery rate control , 2009, 0903.2003.

[44]  Babita Pandey,et al.  Knowledge and intelligent computing system in medicine , 2009, Comput. Biol. Medicine.

[45]  C. Zappa,et al.  Non-small cell lung cancer: current treatment and future advances. , 2016, Translational lung cancer research.

[46]  M. Wahidi,et al.  Establishing the diagnosis of lung cancer: Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. , 2013, Chest.

[47]  B. Chan,et al.  Targeted therapy for non-small cell lung cancer: current standards and the promise of the future. , 2015, Translational lung cancer research.

[48]  Ignacio I Wistuba,et al.  Current concepts on the molecular pathology of non-small cell lung carcinoma. , 2014, Seminars in diagnostic pathology.

[49]  A. Jemal,et al.  Cancer statistics, 2015 , 2015, CA: a cancer journal for clinicians.

[50]  Paul J. Kennedy,et al.  Case-Based Retrieval Framework for Gene Expression Data , 2015, Cancer informatics.

[51]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[52]  Zhi-Yi Zhou,et al.  Significance of TRIM29 and β‐catenin expression in non‐small‐cell lung cancer , 2012, Journal of the Chinese Medical Association : JCMA.

[53]  C. Sima,et al.  Immunohistochemical algorithm for differentiation of lung adenocarcinoma and squamous cell carcinoma based on large series of whole-tissue sections with validation in small specimens , 2011, Modern Pathology.