MuST-C gender-balanced validation set

This is a new MuST-C validation set specifically designed to train ST systems for experiments on gender translation.

While the standard MuST-C validation set reflects the same gender-imbalanced distribution found in MuST-C training data (30% female vs. 70% male speakers), this new validation set contains 20 TED Talks balanced according to speakers’ gender, thus avoiding to reward models’ potentially biased behaviour.

The procedure followed to select talks according to speakers’ gender information is the same used to create MuST-Speakers, in which all  the 2,545 TED talks included in MuST-C V1.2 have been manually labelled on the basis of the personal pronouns found in their publicly available personal TED section.

We remark that by relying on speakers’ pronouns in our talk selection we do not make any assumption on the speakers’ gender identity. Rather, we exclusively account for the gender linguistic forms by which the speakers accept to be referred to in English, and most likely want their translation to conform to.

The gender-balanced validation set was conceived  to develop  models to be evaluated on MuST-SHE, a benchmark derived from MuST-C which allows for a fine-grained analysis of gender bias in Machine Translation and Speech Translation. For this reasons, it covers the same language directions as MuST-SHE, namely:

  • English-French
  • English-Italian
  • English-Spanish

 

How to obtain MuST-C gender-balanced validation set

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0.

MuST-C gender-balanced validation set is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY NC ND 4.0 International) license, and is freely downloadable.
 


 

Reference paper

If you use this validation set in your work, please cite the following paper:

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
Breeding Gender-Aware Direct Speech Translation Systems
In Proceedings of the  28th International Conference on Computational Linguistics (COLING’2020), December 8-13 2020, Online, pp 3951-3964.

Bibtex

@inproceedings{gaido-etal-2020-breeding,
title = “Breeding Gender-aware Direct Speech Translation Systems”,
author = “Gaido, Marco and Savoldi, Beatrice and Bentivogli, Luisa and Negri, Matteo and Turchi, Marco”,
booktitle = “Proceedings of the 28th International Conference on Computational Linguistics”,
month = dec,
year = “2020”,
address = “Barcelona, Spain (Online)”,
publisher = “International Committee on Computational Linguistics”,
url = “https://www.aclweb.org/anthology/2020.coling-main.350”,
pages = “3951–3964”}

Related resources for research on gender translation

  • MuST-Speakers: annotation of MuST-C talks with speakers’ gender information.
  • MuST-SHE: a benchmark derived from MuST-C which allows for a fine-grained analysis of gender bias in Machine Translation and Speech Translation.
  • Code to generate the ST systems presented in the COLING’2020 paper.