Purpose: Cone-beam CT is a common approach for 3D guidance in interventional radiology (IR), but its long image acquisition time poses a limit to image quality due to complex deformable motion of soft-tissue structures. Multi-region autofocus optimization showed successful compensation of deformable motion with gains in image quality. However, the use of sub-optimal conventional autofocus metrics, accompanied by the high-dimensionality and non-convexity of the optimization problem result in challenged convergence. This work presents a learning-based approach to the design of autofocus metrics tailored to the specific problem of motion compensation in soft-tissue CBCT. Methods: A deep convolutional neural network (CNN) was used to automatically extract image features quantifying the local motion amplitude and principal direction in volumetric regions of interest (128 x 128x 128 voxels) of a motion contaminated volume. The estimated motion amplitude is then used as the core component of the cost function of the deformable autofocus method, complemented by a regularization term favoring similar motion for regions close in space. The network consists of a siamese arrangement of three branches acting on the three central orthogonal planes of the volumetric ROI. The network was trained with simulated data generated from publicly available CT datasets, including deformable motion fields from which volumetric ROIs with locally rigid motion were extracted. The performance of motion amplitude estimation and of the final CNN-based deformable autofocus were assessed on synthetic CBCT data generated similarly to the training dataset and featuring deformable motion fields with 1 to 5 components with random direction and random amplitude ranging from 0 mm to 50 mm. Results: Predicted local motion amplitude showed good agreement with the true values, showing a liner relationship (R2 = 0.96, slope 0.95), and slight underestimation of the motion amplitude. Absolute errors in total motion amplitude and individual components remained < 2 mm throughout the explored range. Relative errors were higher for low amplitude motion, pointing to need of a larger training cohort, more focused on low motion amplitude scenarios. Motion compensation with the learning-based metric showed successful compensation of motion artifacts in single ROI environments, with 40% reduction in median RMSE using the motion static image as reference. Deformable motion compensation resulted in better visualization of soft-tissue structures, and overall sharper image details, with slight residual motion artifacts for regions combining moderate motion amplitude with complex anatomical structures. Conclusion: Deformable motion compensation with automatically learned autofocus metrics was proved feasible, opening the way to the design of metrics with potential for more reliable performance and easier optimization than conventional autofocus metrics.