Treebanking in Northern Europe: A White Paper

We present the case for an extensive scientific effort to build up large treebanks for the Nordic and Baltic languages, as a step towards developing advanced multilingual communication technologies for these languages in the future. Nordic language technology is urgent Language and speech processing is rapidly becoming a priority area for Northern Europe (in which we include the Baltic states). Two recent, broad developments are causes of this urgency. First, the speed at which digital information technologies have penetrated our society has accelerated particularly strongly in Northern Europe. Secondly, the expansion of the European Union has had an impact on the interaction between these countries, in particular between the Nordic states and the Baltic states. The challenges for multilingual text and speech processing are enormous. The Northern European area comprises about 31.5 million people (not including Northern Russia) speaking 8 official languages (9 including Russian) and several minority languages. Roughly 45% of all EU inhabitants are able to converse in at least two different languages. The language that is most often the common denominator in Europe, also in Northern Europe, is English, although in the Baltic states and Russia, English language competency is lower, for historical reasons. From a political viewpoint, it is not acceptable to promote English at the expense of the languages of the region. Most of the information on the Internet consists of natural language, of which less than 50% is currently in English, compared to nearly 100% just ten years ago. Consequently, there is a tremendous and increasing need for language processing tools that make this information accessible to users of different languages. Among these linguistic tools we mention information search and retrieval; filtering, indexing and classification; summarization; translation; text-tospeech and dictation, etc. Recent advances in computational linguistic research are making the development of efficient tools feasible, but it is important to remember that these tools cannot be made independent of the particular language to be treated. English language tools will simply not work for Finnish. Advanced Nordic language resources are needed The development of language-specific tools typically requires research on very extensive language resources. These comprise large text and speech collections commonly called corpora. Adequately coded and quality-controlled corpora provide the empirical basis for nearly every stage in research and development of language technology products: (i) compiling linguistic requirements and specifications for new systems, (ii) extracting linguistic knowledge in the form of word lists, grammar rules, etc. (iii) repeated testing of research stage prototypes against real language data, and (iv) final evaluation of systems and applications. Not only must corpora be very large in order to be representative, they must also be carefully encoded and enriched with linguistic descriptions. Given the massive ambiguity in natural language, raw text corpora are of limited use. To take a very simple example, in the Swedish sentence (1) it is impossible, based on word order alone, to determine what is the subject and what is the object of

[1]  P. Osenova,et al.  ‘An HPSG-based Syntactic Treebank of Bulgarian (BulTreeBank)’ , 2002 .

[2]  Andy Way,et al.  Automatic F-Structure Annotation from the AP Treebank , 2000 .

[3]  Joakim Nivre,et al.  Deterministic Dependency Parsing of English Text , 2004, COLING.

[4]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[5]  Elisabeth Ahlsén,et al.  The Spoken Language Corpus at the Linguistics Department, Göteborg University , 2000 .

[6]  Tiit Hennoste,et al.  Structure and Usage of the Tartu University Corpus of Written Estonian , 1998 .

[7]  Joakim Nivre,et al.  What kinds of trees grow in Swedish soil , 2002 .

[8]  Joakim Nivre,et al.  Theory-supporting treebanks , 2003 .

[9]  Gunnar Eriksson,et al.  The Linguistic Annotation System of the Stockholm - Umea , 1993, EACL.

[10]  Miriam Butt,et al.  The Parallel Grammar Project , 2002, COLING 2002.

[11]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[12]  Eugene Charniak,et al.  Tree-Bank Grammars , 1996, AAAI/IAAI, Vol. 2.

[13]  Geoffrey Sampson,et al.  Thoughts on Two Decades of Drawing Trees , 2003 .

[14]  Joakim Nivre,et al.  Memory-Based Dependency Parsing , 2004, CoNLL.

[15]  Eckhard Bick Arboretum, a Hybrid Treebank for Danish , 2004 .

[16]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[17]  Beata Megyesi Data-driven syntactic analysis , 2002 .

[18]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[19]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[20]  Jostein Gaarder Sofies verden : roman om filosofiens historie , 1995 .

[21]  M. Trautner,et al.  The Danish Dependency Treebank and the DTAG Treebank Tool , 2003 .

[22]  Johansson. Stig,et al.  Manual of information to accompany the Lancaster-Oslo : Bergen Corpus of British English, for use with digital computers , 1978 .

[23]  Martha Palmer,et al.  PropBank: the Next Level of TreeBank , 2003 .

[24]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[25]  Janne Bondi Johannessen,et al.  A Web-based Advanced and User Friendly System: The Oslo Corpus of Tagged Norwegian Texts , 2000, LREC.

[26]  Dan Klein,et al.  Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank , 2001, ACL.

[27]  M. Volk,et al.  Bootstrapping Parallel Treebanks , 2004, COLING 2004.

[28]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[29]  Christopher D. Manning,et al.  LinGO Redwoods A Rich and Dynamic Treebank for HPSG , 2002 .

[30]  Eckhard Bick,et al.  Arborest – a VISL-Style Treebank Derived from an Estonian Constraint Grammar Corpus , 2004 .

[31]  Mark Steedman,et al.  Acquiring Compact Lexicalized Grammars from a Cleaner Treebank , 2002, LREC.

[32]  Thorsten Brants,et al.  Interactive Corpus Annotation , 2000, LREC.