Parameters tuning boosts hyperSMURF predictions of rare deleterious non-coding genetic variants

The regulatory code that determines whether and how a given genetic variant affects the function of a regulatory element remains poorly understood for most classes of regulatory variation. Indeed the large majority of bioinformatics tools have been developed to predict the pathogenicity of genetic variants in coding sequences or conserved splice sites. Computational algorithms for the prediction of non-coding deleterious variants associated with rare genetic diseases are faced with special challenges owing to the rarity of confirmed pathogenic mutations. Indeed in this context classical machine learning methods are biased toward neutral variants that constitute the large majority of genetic variation, and are not able to detect the potential deleterious variants that constitute only a tiny minority of all known genetic variation. We recently proposed hyperSMURF, hyper-ensemble of SMOTE Undersampled Random Forests, an ensemble approach explicitly designed to deal with the huge imbalance between deleterious and neutral variants, and able to significantly outperform state-of-the-art methods for the prediction of non-coding variants associated with Mendelian diseases. Despite its successful application to the detection of deleterious single nucleotide variants (SNV) as well as to small insertions or deletions (indels), hyperSMURF is a method that depends on several learning parameters, that strongly influence its overall performances. In this work we experimentally show that by tuning hyperSMURF parameters we can significantly boost the performance of the method, thus predicting with significantly better precision and recall rare SNVs associated with Mendelian diseases.