Line-a-line: A Tool for Annotating Word-Alignments

We here describe line-a-line, a web-based tool for manual annotation of word-alignments in sentence-aligned parallel corpora. The graphical user interface, which builds on a design template from the Jigsaw system for investigative analysis, displays the words from each sentence pair that is to be annotated as elements in two vertical lists. An alignment between two words is annotated by drag-and-drop, i.e. by dragging an element from the left-hand list and dropping it on an element in the right-hand list. The tool indicates that two words are aligned by lines that connect them and by highlighting associated words when the mouse is hovered over them. Line-a-line uses the efmaral library for producing pre-annotated alignments, on which the user can base the manual annotation. The tool is mainly planned to be used on moderately under-resourced languages, for which resources in the form of parallel corpora are scarce. The automatic word-alignment functionality therefore also incorporates information derived from non-parallel resources, in the form of pre-trained multilingual word embeddings from the MUSE library.

[1]  Jörg Tiedemann,et al.  Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[2]  Hitoshi Isahara,et al.  Word Alignment Annotation in a Japanese-Chinese Parallel Corpus , 2008, LREC.

[3]  C. Görg,et al.  Jigsaw: investigative analysis on text document collectionsthrough visualization , 2008 .

[4]  Mats Wirén,et al.  SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora , 2019, CLARIN Annual Conference.

[5]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[6]  Magnus Merkel,et al.  Interactive Word Alignment for Corpus Linguistics , 2003 .

[7]  Andreas Kerren,et al.  Topics2Themes : Computer-Assisted Argument Extraction by Visual Analysis of Important Topics , 2018 .

[8]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[9]  Manfred Stede,et al.  Constructing a Lexicon of Dutch Discourse Connectives , 2018 .

[10]  Werner Winiwarter,et al.  A Visualizing Annotation Tool for Semi-Automatically Building a Bilingual Corpus , 2012 .

[11]  Östen Dahl,et al.  Perfects and iamitives: two gram types in one grammatical space , 2016 .

[12]  Jörg Tiedemann,et al.  Phrase-Based SMT for Finnish with More Data, Better Models and Alternative Alignment and Translation Tools , 2016, WMT.

[13]  Simon Dahlberg Tre svenska myndigheters strategier för termöversättning till spanska och franska , 2017 .

[14]  Philipp Koehn,et al.  Explorer Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation , 2005 .

[15]  Hermann Ney,et al.  Alignment-Based Neural Machine Translation , 2016, WMT.

[16]  Kevin Knight,et al.  Using Word Vectors to Improve Word Alignments for Low Resource Machine Translation , 2018, NAACL.