Extracting roads from aerial images is an issue that has attracted much attention. Using semantic segmentation methods to extract roads often faces the problem of narrow and occluded roads. In this letter, we propose a network called ConDinet++, which improves the general codec architecture. In the encoder part, the VGG16 with pretraining parameters is utilized for the feature extraction. In the decoder part, we perform a feature fusion mechanism on the full-scale feature map. In order to improve the ability of the network to extract and integrate semantic information and further increase the receptive field, we recommend adopting the conditional dilated convolution blocks (CDBs) in the encoder, and each CDB consists of a group of cascaded conditional dilated convolutions. More importantly, the designed codec architecture can adjust the number of convolutions and the parameters of the convolution kernel according to the input data. For a slender area like a road, which occupies a small area in the picture, we use the joint loss function and introduce the joint loss of Lovasz loss and cross-entropy loss to avoid the segmentation model having a serious bias caused by highly unbalanced object sizes between roads and background. The proposed method was tested on two public datasets Massachusetts Roads Dataset and Mini DeepGlobe Road Extraction Challenge. Compared with some previous semantic segmentation networks, the proposed ConDinet++ achieved the best values of recall, F-score, and mIoU.