STAPLE performance assessed on crowdsourced sclera segmentations

The Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm is frequently used in medical image segmentation without available ground truth (GT). In this paper, we investigate the number of inexperi- enced users required to establish a reliable STAPLE-based GT and the number of vertices the user’s shall place for a point-based segmentation. We employ “WeLineation”, a novel web-based system for crowdsourcing seg- mentations. Within the study, 2,060 masks have been delivered by 44 users on 75 different photographic images of the human eye, where users had to segment the sclera. For all masks, GT was estimated using STAPLE. Then, STAPLE is computed using fewer user contributions and results are compared to the GT. Requiring an error rate lower than 2%, same segmentation performance is obtained with 13 experienced and 22 rather inexperienced users. More than 10 vertices shall be placed on the delineation contour in order to reach an accuracy larger than 95%. In average, a vertex along the segmentation contour shall be placed every 81 pixels. The results indicate that knowledge about the users performance can reduce the number of segmentation masks per image, which are needed to estimate reliable GT. Therefore, gathering performance parameters of users during a crowdsourcing study and applying this information to the assignment process is recommended. In this way, benefits in the cost-effectiveness of a crowdsourcing segmentation study can be achieved.