SoMaJo: State-of-the-art tokenization for German web and social media texts

In this paper we describe SoMaJo, a rulebased tokenizer for German web and social media texts that was the best-performing system in the EmpiriST 2015 shared task with an average F1-score of 99.57. We give an overview of the system and the phenomena its rules cover, as well as a detailed error analysis. The tokenizer is available as free software.