A Semi-Decoupled Approach to Fast and Optimal Hardware-Software Co-Design of Neural Accelerators