Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality

BackgroundIn the era of precision oncology and publicly available datasets, the amount of information available for each patient case has dramatically increased. From clinical variables and PET-CT radiomics measures to DNA-variant and RNA expression profiles, such a wide variety of data presents a multitude of challenges. Large clinical datasets are subject to sparsely and/or inconsistently populated fields. Corresponding sequencing profiles can suffer from the problem of high-dimensionality, where making useful inferences can be difficult without correspondingly large numbers of instances. In this paper we report a novel deployment of machine learning techniques to handle data sparsity and high dimensionality, while evaluating potential biomarkers in the form of unsupervised transformations of RNA data. We apply preprocessing, MICE imputation, and sparse principal component analysis (SPCA) to improve the usability of more than 500 patient cases from the TCGA-HNSC dataset for enhancing future oncological decision support for Head and Neck Squamous Cell Carcinoma (HNSCC).ResultsImputation was shown to improve prognostic ability of sparse clinical treatment variables. SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant. Gene ontology enrichment analysis of gene sets associated with individual sparse principal components (SPCs) are also reported, showing that both high- and low-importance SPCs were associated with cell death pathways, though the high-importance gene sets were found to be associated with a wider variety of cancer-related biological processes.ConclusionsMICE imputation allowed us to impute missing values for clinically informative features, improving their overall importance for predicting two-year recurrence-free survival by incorporating variance from other clinical variables. Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly. SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.

[1]  Zhongheng Zhang,et al.  Multiple imputation with multivariate imputation by chained equation (MICE) package. , 2016, Annals of translational medicine.

[2]  P. Royston,et al.  Patrick Royston model with a binary outcome A comparison of imputation techniques for handling missing predictor values in a risk , 2007 .

[3]  Andrew Feber,et al.  Human Papillomavirus Drives Tumor Development Throughout the Head and Neck: Improved Prognosis Is Associated With an Immune Response Largely Restricted to the Oropharynx. , 2016, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[4]  Ulrich Lang,et al.  Integrative analysis and machine learning on cancer genomics data using the Cancer Systems Biology Database (CancerSysDB) , 2018, BMC Bioinformatics.

[5]  Jian Pan,et al.  Bioinformatic analysis of PFN2 dysregulation and its prognostic value in head and neck squamous carcinoma. , 2018, Future oncology.

[6]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[7]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[8]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[9]  Kumardeep Chaudhary,et al.  Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer , 2017, Clinical Cancer Research.

[10]  Richa Agarwala,et al.  PedHunter 2.0 and its usage to characterize the founder structure of the Old Order Amish of Lancaster County , 2010, BMC Medical Genetics.

[11]  Ljubomir J. Buturovic,et al.  Cross-validation pitfalls when selecting and assessing regression and classification models , 2014, Journal of Cheminformatics.

[12]  Anushya Muruganujan,et al.  PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements , 2016, Nucleic Acids Res..

[13]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[14]  T. Hampton,et al.  The Cancer Genome Atlas , 2020, Indian Journal of Medical and Paediatric Oncology.

[15]  Agostino Di Ciaccio,et al.  Computational Statistics and Data Analysis Measuring the Prediction Error. a Comparison of Cross-validation, Bootstrap and Covariance Penalty Methods , 2022 .

[16]  Lei Wang,et al.  FSCN1 is upregulated by SNAI2 and promotes epithelial to mesenchymal transition in head and neck squamous cell carcinoma , 2017, Cell Biology International.

[17]  Graham J. Williams,et al.  wsrf: An R Package for Classification with Scalable Weighted Subspace Random Forests , 2017 .

[18]  Brian O'Sullivan,et al.  Human Papillomavirus Genotype Association With Survival in Head and Neck Squamous Cell Carcinoma. , 2016, JAMA oncology.

[19]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[20]  The Gene Ontology Consortium Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[21]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[22]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[23]  Jian Zhang,et al.  Transcriptional response profiles of paired tumor-normal samples offer novel perspectives in pan-cancer analysis , 2017, Oncotarget.

[24]  Lana X. Garmire,et al.  Deep Learning based multi-omics integration robustly predicts survival in liver cancer , 2017, bioRxiv.

[25]  Yi Deng,et al.  Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data , 2016, Scientific Reports.

[26]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[27]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[28]  Kyung-ah Sohn,et al.  Integrative pathway-based survival prediction utilizing the interaction between gene expression and DNA methylation in breast cancer , 2018, BMC Medical Genomics.

[29]  K. Patel,et al.  TCGA Data on Head and Neck Squamous Cell Carcinoma Suggest Therapy-Specific Implications of Intratumor Heterogeneity , 2018 .

[30]  Hsi-Yuan Huang,et al.  An Integrative Analysis for Cancer Studies , 2016, 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE).

[31]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .