论文信息 - EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres

EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres

We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Automatic Linguistic Annotation of ComputerMediated Communication / Social Media”. Our system is based on a rulebased tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the system is robust across the two tested genres: German computer mediated communication (CMC) and general German web data (WEB). We achieve the second rank in three of four scenarios. Also, the presented systems are freely available as open source components.

[1] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2] Wolfgang Lezius,et al. TIGER: Linguistic Interpretation of a German Corpus , 2004 .

[3] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4] Chunyu Kit,et al. Tokenization as the Initial Phase in NLP , 1992, COLING.

[5] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[6] David A. Ferrucci,et al. UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[7] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[8] Christian Biemann,et al. GermaNER: Free Open German Named Entity Recognition Tool , 2015, GSCL.

[9] Bryan Jurish,et al. Word and Sentence Tokenization with Hidden Markov Models , 2013, J. Lang. Technol. Comput. Linguistics.

[10] Christian Biemann,et al. Text: now in 2D! A framework for lexical expansion with contextual similarity , 2013, J. Lang. Model..

[11] Tibor Kiss,et al. Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[12] Praveen Paritosh,et al. Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.