Building footprint segmentation from high-resolution remote sensing (RS) images plays a vital role in urban planning, disaster response, and population density estimation. Convolutional neural networks (CNNs) have been recently used as a workhorse for effectively generating building footprints. However, to completely exploit the prediction power of CNNs, large-scale pixel-level annotations are required. Most state-of-the-art methods based on CNNs are focused on the design of network architectures for improving the predictions of building footprints with full annotations, while few works have been done on building footprint segmentation with limited annotations. In this article, we propose a novel semisupervised learning method for building footprint segmentation, which can effectively predict building footprints based on the network trained with few annotations (e.g., only <inline-formula><tex-math notation="LaTeX">$\text{0.0324 {km}}^2$</tex-math></inline-formula> out of <inline-formula><tex-math notation="LaTeX">$\text{2.25-{km}}^2$</tex-math></inline-formula> area is labeled). The proposed method is based on investigating the contrast between the building and background pixels in latent space and the consistency of predictions obtained from the CNN models when the input RS images are perturbed. Thus, we term the proposed semisupervised learning framework of building footprint segmentation as <monospace>PiCoCo</monospace>, which is based on the enforcement of <underline>Pi</underline>xelwise <underline>Co</underline>ntrast and <underline>Co</underline>nsistency during the learning phase. Our experiments, conducted on two benchmark building segmentation datasets, validate the effectiveness of our proposed framework as compared to several state-of-the-art building footprint extraction and semisupervised semantic segmentation methods.