Managing a Multilingual Treebank Project

This paper describes the work process for a Multilingual Treebank Annotation Project executed for Google and coordinated by a small core team supervising the linguistic work conducted by linguists working online in various locations across the globe. The task is to review an output of a dependency-syntactic parser, including the POS types, dependency types and relations between the tokens, fix errors in output and prepare the data to a shape that can be used for further training of the parser engine. In this paper we focus on the implemented Quality Assurance processes and methodology that are used to monitor the output of the four language teams engaged in the project. On the quantitative side we monitor the throughput to spot any issues in particular language that would require intervention or improving the process. This is combined with a qualitative analysis that is performed primarily by comparing the incoming parsed data, the reviewed data after the first round and after the final crossreview using snapshots to compile and compare statistics. In addition, the possible inconsistencies in the annotations are checked and corrected automatically, where possible, in appropriate stages of the process to minimize the manual work.