Transformer models have revolutionized the field of Natural Language Processing (NLP) and they achieve state-of-the-art performance in applications like machine translation, question answering, regression, and summarization. However, training Transformers is challenging because of their large memory and compute requirements. The literature contains several approaches to parallelize training, like layer parallelism and pipeline parallelism, but they are optimized to benefit out-of-core models and they don’t exploit the inherent parallelism in Transformer models. Other work uses model parallelism to achieve weak scaling by increasing the model size. In this paper, we propose sub-graph parallelism that provides a significant performance improvement over pure data parallelism with a fixed number of resources, and as an additional technique for strong- and weak-scaling without increasing model capacity. Our technique accelerates the training of Transformer models and we generalize the concept to any neural network with multiple branches. We optimize the communication for sub-graph parallelism and combine it with data parallelism to scale performance up to 1024 GPUs. To decrease communication overheads, we propose a topology-aware scheme that limits inter-node communication. Finally, we empirically compare sub-graph parallelism with pure data parallelism and demonstrate its performance benefits in end-to-end training.