The potential of processing user-generated texts freely available on the web is widely recognized, but due to the non-canonical nature of the language used in the web, it is not possible to process these data using conventional methodologies designed for well-edited formal texts. Procedures for properly annotating raw web data have not been as extensively researched as those for annotating well-edited texts, as also evident from the viewpoint of Turkish language processing. Moreover, there is a considerable shortage of human-annotated corpora derived from Turkish web data. The ITU Web Treebank is the first attempt for a diverse corpus compiled from Turkish texts found on the web. In this paper, we first present our survey of the non-canonical aspects of the language used in the Turkish web. Next, we discuss in detail the annotation procedure followed in the ITU Web Treebank, revised for compatibility with the language of the web. Finally, we describe the web-based annotation tool following this procedure, on which the treebank was annotated.
[1]
Benoît Sagot,et al.
The French Social Media Bank: a Treebank of Noisy User Generated Content
,
2012,
COLING.
[2]
Josef van Genabith,et al.
#hardtoparse: POS Tagging and Parsing the Twitterverse
,
2011,
Analyzing Microtext.
[3]
Gülsen Eryigit.
ITU Treebank Annotation Tool
,
2007,
LAW@ACL.
[4]
Brendan T. O'Connor,et al.
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
,
2010,
ACL.
[5]
Gülşen Eryiğit,et al.
Redefinition of Turkish Morphology Using Flag Diacritics
,
2013
.
[6]
Kemal Oflazer,et al.
The Annotation Process in the Turkish Treebank
,
2003,
LINC@EACL.
[7]
GÜLŞEN ERYİǦİT,et al.
Social media text normalization for Turkish
,
2017,
Natural Language Engineering.
[8]
Gülşen Eryiğit,et al.
ITU Validation Set for Metu-Sabancı Turkish Treebank
,
2014
.
[9]
Gülşen Eryiğit,et al.
A Mobile Assistant for Turkish
,
2014
.