With the growing popularity of civilian unmanned aerial vehicles (UAVs), unauthorized flights are on the rise accordingly. Therefore, it is critical to detect low-altitude UAVs for protecting personal privacy and public safety. Though substantial progress has been made in UAV detection, the existing detection methods still have problems in balancing the detection accuracy, the model size, and the detection speed. To address these limitations, this article proposes a novel deep learning method named convolution–transformer network (CT-Net). First, the attention-enhanced transformer block (AETB), which builds a feature-enhanced multihead self-attention (FEMSA), is introduced into the backbone of the network to improve the feature extraction ability of the model. Then, a lightweight bottleneck module (LBM) is utilized to control the computation load and reduce the parameters. Finally, we present a directional feature fusion structure (DFFS) to improve the accuracy of detection when processing multiscale objects, especially small-size objects. The proposed scheme achieves 0.966 mAP with the input size of 640 $\times640$ pixels on our low-altitude small-object dataset, and the performance is superior to YOLOv5. Furthermore, the experimental result on MS COCO shows that the CT-Net can outperform the current state-of-the-art detectors by a large margin. Thus, it is feasible to apply CT-Net to low-altitude small-object detection according to the experimental results.