Detecting and understanding interactions between students and teachers in classroom is an important criterion for computer vision-based educational assistive systems. Recently, deep long-range spatial dependencies modeling techniques, such as non-local networks, have been proven to be very effective for such tasks. Yet, regarding global context generation, we analyze that the non-local operation only compares pixels using their values, which cannot pertain to structural information. In this paper, we first extend the non-local module to corporate locality attributes. We further observe that each query is treated uniformly to generate the attention map. Hence, we incorporate distance-wise representations with an efficient implementation into the non-local formulas. The proposed locality and relative distance-aware non-local module is integrated into an object detection architecture namely Libra-RCNN and is evaluated through our experiments on a pre-access hand-raising gesture dataset. Our straightforward modification achieves 0.8% and 2.8% higher performance compared to the baseline Libra-RCNN model, in terms of mAP 0.5 and mAP 0.75 respectively.