Approximate computing is a promising alternative to improve energy efficiency for IoT devices on the edge. This work proposes an optimally approximated and unbiased floating-point approximate multiplier with runtime configurability. We provide a theoretically sound formulation that turns multiplication approximation to an optimization problem. With the formulation and findings, a multilevel architecture is proposed to easily incorporate runtime configurability and module execution parallelism. Finally, an optimization scheme is applied to improve the area, making it linearly dependent on the precision, instead of quadratically or exponentially as in prior work. In addition to the optimal approximation and configurability, the proposed design has an efficient circuit implementation that uses inversion, shift and addition instead of complex arithmetic operations. When compared to the prior state-of-the-art approximate floating-point multiplier, ApproxLP [30], the proposed design outperforms in all aspects including accuracy, area, and delay. By replacing the regular full-precision multiplier in GPU, the proposed design can improve the energy efficiency for various edge computing tasks. Even with Level 1 approximation, the proposed design improves energy efficiency up to 122× for machine learning on CIFAR-10, with almost negligible accuracy loss.