The BitterCorpus is a collection of parallel en-ita documents in the IT domain where domain-specific terms have been manually marked and aligned. The documents are extracted from the GNOME and the KDE data collections. They contain 874 domain-specific bilingual terms in total.
It contains 55 parallel documents extracted from the Gnome manual documentation (IT domain). Three annotators, fluent in English and Italian, have been selected to annotate the documents with domain-specific terms. In total, they annotate 313 Italian and 282 English terms and 237 bilingual domain-specific terms.
It contains one parallel document extracted from the KDE manual documentation (IT domain), whereby the document is made of 100 lines of text.Three annotators, fluent in English and Italian, have been selected to annotate the documents with domain-specific terms. In total, they annotate 628 Italian and 628 English terms, and 637 bilingual domain-specific terms.
BitterCorpus is freely available for research purposes, and is distributed under a Creative Commons Attribution- NonCommercial-ShareAlike license.
The data were used for the SMT evaluation presented in:
Mihael Arcan, Marco Turchi, Sara Tonelli and Paul Buitelaar “Enhancing Statistical Machine Translation with Bilingual Terminology in a CAT Environment“. In Proceedings of AMTA 2014.
If you use the corpus, please cite the above paper.
Contact us: turchi[at]fbk.eu