The raw output of an Automatic Speech Recognition system usually consists in a stream of words without any casing nor punctuation. In order to improve the readability and enable further uses of this output, punctuation and capitalisation have to be included. In this context, we present AutoPunct, a Transformers-based automatic punctuation and capitalisation model that combines both acoustic (i.e. silences duration) and lexical information (the words themselves). We compared its performance with a system based on Bidirectional Recurrent Neural Networks (BRNN) on Basque (a low-resource language) and Spanish, both individually and simultaneously. The result is a system that achieves high accuracy for punctuation and capitalisation in both languages at the same time, with a throughput of several thousand words per second using a standard GPU.