Reply to ‘Inflated performance measures in enhancer–promoter interaction-prediction methods’

Cao and Yip reply — Cao and Fullwood1 argue that the reported performance of JEME2 was affected by the presence of positive pairs of enhancers and transcription start sites (TSSs) in the training and testing sets that overlap in their window regions. They also show that our random targets approach to generating negative pairs led to very different distance distributions for the positive and negative pairs (Fig. 2g in ref. 1), thus making it trivial to separate the positives and negatives on the basis of genomic distance features alone. Random targets is one of the four approaches that we used to generate negative pairs. The resulting dataset is realistic because enhancers do tend to regulate TSSs nearby. In contrast, for the specific goal of testing whether JEME can distinguish between positive and negative pairs of similar distances, we also designed the random contacts approach of generating negative pairs. In this setting, positive and negative pairs have more similar, though not identical, distance distributions (Fig. 1a,b), and, according to our original paper, JEME performed substantially better than several baseline methods (Fig. 3c in ref. 2). In addition, the dataset used for comparing JEME with Ripple3 (which was actually not based on the random targets approach) has almost identical distance distributions for the positive and negative pairs. Evaluated on this dataset, JEME still achieved a reasonable performance of 0.6 area under the precision-recall curve (AUPR), a performance comparable to that of Ripple, a method specifically designed for this setting. Regarding the issue of overlapping positive examples in the training and testing sets, because ultimately the main use of JEME is to predict enhancer–target interactions in unseen samples, the across-sample tests are most informative. In these tests, all information from the training sample can be used to predict the enhancer–target pairs in the testing sample, and therefore there is no concern regarding overlapping positive examples. Cao and Fullwood report that in the across-sample test involving K562 as the training sample and GM12878 as the testing sample, JEME was not as accurate as a simplified version of JEME involving only genomic distance features. They achieved this result by tuning the maximum tree depth parameter of the random-forest model, whereas we used the default setting without any parameter tuning in our original paper, thus explaining the difference. For GM12878, only expression quantitative trait locus (eQTL) data were available for validating the predicted enhancer–target pairs, a scenario that is not ideal, because eQTLs could include indirect interactions and could miss true enhancer–target interactions with no eQTLs at the enhancers. We included 0 250 500 750 1,000