WAGS (Word Alignment Gold Standard) is a novel benchmark which allows extensive evaluation of WA tools on out-of-vocabulary (OOV) and rare words.
WAGS is a subset of the Common Test section of the Europarl English-Italian parallel corpus, and is specifically tailored to OOV and rare words. WAGS is composed of 6,715 sentence pairs containing 11,958 occurrences of OOV and rare words up to frequency 15 in the Europarl Training set (5,080 English words and 6,878 Italian words), representing almost 3% of the whole text.
Since WAGS is focused on OOV/rare words, manual alignments are provided for these words only, and not for the whole sentences.
The dataset is released under a Creative Commons Attribution 4.0 International License
Publications or presentations containing results obtained through the use of WAGS should cite the following reference:
L. Bentivogli, M. Cettolo, M. A. Farajian, M. Federico. 2016.
“WAGS: A Beautiful English-Italian Benchmark Supporting Word Alignment Evaluation on Rare Words“. In Proceedings of LREC 2016.
The creation of WAGS was supported by the ModernMT project, financed by European Union’s Horizon 2020 research and innovation programme under grant agreement No 645487.