SelfE: Gene Selection via Self-Expression for Single-Cell Data

Single-cell RNA-sequencing has been proved to be advantageous in discerning molecular heterogeneity in seemingly similar cells in a tissue. Due to the paucity of starting RNA, a large fraction of transcripts fail to amplify during the polymerase chain reaction cycle. This gets compounded by trivial biological noise such as variability in the cell cycle specific genes. As a result expression matrix obtained from a single-cell study is highly sparse with a large number of missing values. This hinders the downstream analysis of single-cell expression data. It has been observed that feature engineering significantly improves the analysis outcomes. Feature extraction methods such as principal component analysis and zero-inflated factor analysis have been shown to be useful for subsequent steps for data analysis including clustering. However, too little or no visible efforts have been observed for developing feature selection techniques, which offer transparency for the analyst's consumption. We propose SelfE, a novel L2,0-minimization algorithm that determines an optimal subset of feature vectors that preserves sub-space structures as observed in the data. We compared SelfE with the commonly used feature selection methods for single-cell expression data analysis.