论文信息 - Sparv : Språkbanken ’ s corpus annotation pipeline infrastructure

Sparv : Språkbanken ’ s corpus annotation pipeline infrastructure

Sparv is Språkbanken’s corpus annotation pipeline infrastructure. The easiest way to use the pipeline is from its web interface with a plain text document. The pipeline uses in-house and external tools on the text to segment it into sentences and paragraphs, tokenise, tag parts-of-speech, look up in dictionaries and analyse compounds. The pipeline can also be run using a web API with XML results, and it is run locally at Språkbanken to prepare the documents in Korp, our corpus search tool. While the most sophisticated support is for modern Swedish, the pipeline supports 15 languages.

[1] Sampo Pyysalo,et al. Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[2] Markus Forsberg,et al. SALDO: a touch of yin to WordNet’s yang , 2013, Lang. Resour. Evaluation.

[3] Noah A. Smith,et al. A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[4] Markus Forsberg,et al. Korp — the corpus infrastructure of Språkbanken , 2012, LREC.

[5] Lluís Padró,et al. FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[6] Sampo Pyysalo,et al. brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[7] András Kornai,et al. HunPos: an open source trigram tagger , 2007, ACL 2007.

[8] Kenneth Ward Church,et al. A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[9] Lars Borin,et al. HFST-SweNER — A New NER Resource for Swedish , 2014, LREC.

[10] Simon Krek,et al. Electronic lexicography in the 21st century: thinking outside the paper : proceedings of the eLex 2013 conference, 17-19 October 2013, Tallinn, Estonia , 2013 .

[11] Markus Forsberg,et al. The lexical editing system of Karp , 2013 .

[12] Joakim Nivre,et al. MaltParser: A language-independent system for data-driven dependency parsing , 2007 .

[13] Joakim Nivre,et al. Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation , 2006, LREC.

[14] Helmut Schmidt,et al. Probabilistic part-of-speech tagging using decision trees , 1994 .