With the rise of IoT and edge computing, deploying neural networks (NNs) on low-power edge computing devices is drawing more and more attention. In NNs, convolutional layers take up the majority of the computing cycles, especially when NNs are implemented on ARM processors. Therefore, it is necessary to optimize the convolutional implementation on ARM Cortex-M MCUs. This paper proposes an efficient im2row-based fast convolution algorithm with two innovations. First, a novel im2row method for reusing the data of adjacent convolutional windows is presented. This method utilizes a reusable im2row buffer for data reuse, significantly reducing the amount of data copied during im2row and improving efficiency. Second, in algorithm implementation, a <italic>q7_t</italic> to <italic>q15_t</italic> data type extension technique that avoids data reordering is employed. This technique eliminates data reordering instructions, thus reducing the runtime of the algorithm. We evaluate our algorithm in separate convolutional layers and NNs. The results for convolutional layers show that, compared to baseline, the proposed algorithm speeds up the convolutional layer by an average of <inline-formula> <tex-math notation="LaTeX">$1.42\times $ </tex-math></inline-formula>, and the maximum speedup is up to <inline-formula> <tex-math notation="LaTeX">$2.9\times $ </tex-math></inline-formula>. Experiments on different NNs demonstrate that our algorithm can speed up the overall NN by up to <inline-formula> <tex-math notation="LaTeX">$2.15\times $ </tex-math></inline-formula>.