Benchmarking the impact of data transformation, pre-processing and choice of method in the computational deconvolution of transcriptomics data

Many computational methods to infer proportions of individual cell types from bulk transcriptomics data have been developed (= computational deconvolution). Attempts comparing these methods revealed that the choice of reference signatures is far more important than the method itself. However, a thorough evaluation of the combined impact of data transformation, pre-processing and methodology on the results is still lacking. Using single-cell RNA-sequencing (scRNA-seq) data from human pancreas and PBMCs, we artificially generated hundreds of pseudo-bulk mixtures with varying number of cells and cell types in known proportions, allowing the evaluation of the combined impact on the deconvolution results. Among the methods to perform deconvolution of bulk RNA-seq data we included MuSiC, a method designed to infer the cell type composition of bulk data using scRNA-seq data as reference. Moreover, since most methods require an additional reference matrix containing cell-type specific expression values, we assessed the effect of removing cell types from the reference that were actually present in the mixtures. Further in-depth analyses are currently ongoing.