A Preliminary Analysis of Self-Supervision for Wireless Capsule Endoscopy

Human learning relies on both supervised as well as unsupervised tasks in grasping visual semantics in general. Is it then possible to learn meaningful features from data without annotations or even knowledge about the number of inherent classes within a dataset? This problem is open and being heavily researched as of today. However, most of the focus in finding answers lies in the domain of natural images. Despite impressive progress, the state-of-the-art from such methods is rarely, if ever, transferable directly to other domains. In this paper, we veer off from natural images to investigate self-supervised learning in a challenging medical domain, that of wireless capsule endoscopy. We implement self-supervision pipeline, with the adaptation of two different pretext tasks for learning representations and evaluate the utility of the self-supervised features for clinical diagnosis. We further infer that a gap exists between the actual requirements and resulting characteristics of features when trained under inadequately adapted “self-supervision” which is more pronounced in medical domains and discuss the factors that influence this gap.