Deep Learning for Automatic Diacritics Restoration in Romanian

In this paper we address the issue of automatic diacritics restoration (ADR) for Romanian using deep learning strategies.We compare 6 separate architectures with various mixtures of recurrent and convolutional layers. The input consists in sequences of consecutive words stripped of their diacritic symbols. The network’s task is to learn to restore the diacritics for the entire sequence. No additional linguistic or semantic information is used as input to the networks.The best results were obtained with a CNN-based architecture and achieved an accuracy of 97% at word level. At diacritic-level the accuracy of the same architecture is 89%.