Prediction of protein stability changes caused by mutation is of major importance to protein engineering and for understanding protein misfolding diseases and protein evolution. The major limitation to these applications is the fact that different prediction methods vary substantially in terms of performance for specific proteins, i.e. performance is not transferable from one type of mutations or protein to another. In this study, we investigated the performance and transferability of eight widely used methods. We first constructed a new dataset comprised of 2647 mutations using strict selection criteria for the experimental data, and then defined a variety of sub-datasets that are unbiased with respect to various aspects such as mutation type, stabilization extent, structure type and solvent exposure. Benchmarking the methods against these sub-datasets enabled us to systematically investigate how data set biases affect predictor performance. In particular, we use a reduced amino acid alphabet to quantify the bias towards mutation type, which we identify as the major bias in current approaches. Our results show that all prediction methods exhibit large biases, stemming not from failures of the models applied, but mostly from the selection biases of experimental data used for training or parametrization. Our identification of these biases and the construction of a new mutation-type-balanced data should lead to the development of more balanced and transferable prediction methods in the future.