Learning and Understanding Deep Spatio-Temporal Representations from Free-Hand Fetal Ultrasound Sweeps

Identifying structures in nonstandard fetal ultrasound planes is a significant challenge, even for human experts, due to high variability of the anatomies in terms of their appearance, scale and position but important for image interpretation and navigation. In this work, our contribution is three-fold: (i) we model local temporal dynamics of video clips, by applying convolutional LSTMs on the intermediate CNN layers, which learns to detect fetal structures at various scales; (ii) we proposed an attention-gated LSTM, which generates spatio-temporal attention maps showing the intermediate process of structure localisation; and (iii) our approach is end-to-end trainable, and the localisation is achieved in a weakly supervised fashion i.e. with only image-level labels available during training. The proposed attention-mechanism is found to improve the detection performance substantially in terms of classification precision and localisation correctness.