EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres

We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Automatic Linguistic Annotation of ComputerMediated Communication / Social Media”. Our system is based on a rulebased tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the system is robust across the two tested genres: German computer mediated communication (CMC) and general German web data (WEB). We achieve the second rank in three of four scenarios. Also, the presented systems are freely available as open source components.