Standardized Multilingual Language Resourcesfor the Web of Data

Statistical knowledge on natural languages is inevitable for various kinds of services requiring Natural Language Processing (NLP) functionality, such as information retrieval. The NLP Group at the University of Leipzig started providing such statistical information for more than 50 languages in the Leipzig Corpora Collection (LCC) [1] more than a decade ago. Some of their corpora contain more than 5 million words and more than 300 million links between them, resulting in an accumulated size of about 60 million words and 814 million links in all corpora. So far, these valuable information could be accessed in a human-readable Web site and through a SOAP Web service, and excerpts of the data could be downloaded as SQL data dumps. A linked data interface for the LCC has now become desirable in order to allow a wider range of applications to make use of the corpora. In this report, the LCC linked data interface is presented. This new service provides information about almost 60 million resources in approximately 900 million triples. Additionally, links to other vocabulary such as WordNet [2] and to DBpedia [3] are offered. The service is realized using a customized version of D2R Server [4].