MuST-SHE

MuST-SHE is a multilingual benchmark allowing for a fine-grained analysis of gender bias in Machine Translation and Speech Translation.

MuST-SHE is a subset of the TED-based MuST-C corpus and is available for English-French, English-Italian and English-Spanish. The dataset is composed of (audio, transcript, translation) triplets annotated with qualitatively differentiated and balanced gender-related phenomena. Each triplet requires the translation of at least one English gender-neutral word into the corresponding masculine/feminine target word(s). MuST-SHE comprises 3,367 triplets (1,164 for En-Es, 1,108 for En-Fr, and 1,095 for En-It) uttered by 295 different speakers. Also, a common subset of 1,040 instances allows for comparative evaluations of gender translation across the three language directions.

The DATA STATEMENT for MuST-SHE v1.2 is available HERE.

 

How to obtain MuST-SHE

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0.

MuST-SHE is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY NC ND 4.0 International) license, and is freely downloadable.

 

Old Versions
  • Version 1.0 (superseeded by v1.1 on 11/2/2021)
  • Version 1.1 (superseeded by v1.2 on 28/7/2021) – Data Statement available here.

 

MuST-SHE v1.2 EXTENSIONS

MuST-SHE extensions consist of two manually created linguistic annotation layers, which enrich the multilingual benchmark with information about Parts-of-Speech and gender agreement chains.

Available for the three language pairs represented in MuST-SHE (en-es/fr/it), these annotations
allow for fine-grained analyses on gender bias and translation that (i) are disaggregated by
POS, thus exhibiting the extent to which different lexical categories are impacted by gender bias,
and (ii) account for the morphosyntactic behaviour of gender agreement beyond the word level.

The MuST-SHE v1.2 EXTENSIONS release includes both the linguistic annotations and the evaluation scripts to compute gender translation accuracy focused on the two dimensions of POS and agreement.

MuST-SHE v1.2 EXTENSIONS are released under the same license as MuST-SHE.

 

Reference papers

  • If you use MuST-SHE  in your work, please cite the following paper:

Luisa Bentivogli, Beatrice Savoldi, Matteo Negri, Mattia Antonino Di Gangi, Roldano Cattoni, Marco Turchi.
“Gender in danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus”.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pages 6923 – 6933, Online, July 2020.

  • If you use MuST-SHE v1.2 extensions, please cite the following paper:

Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri and Marco Turchi.
“Under the Morphosyntactic Lens: A Multifaceted Evaluation of Gender Bias in Speech Translation”.
To appear in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022), Dublin, Ireland, 22-27 May 2022.

Related resources for research on gender translation

These resources are presented in the following paper, which was awarded the nomination of outstanding paper by COLING’2020:

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
Breeding Gender-Aware Direct Speech Translation Systems“.
In Proceedings of the  28th International Conference on Computational Linguistics (COLING’2020), December 8-13 2020, Online, pp 3951-3964.

 

Other papers on gender translation

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation“.
In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, August 2-4, Online, pp 3576-3589

Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri and Marco Turchi.
Gender Bias in Machine Translation.
In Transactions of the Association for Computational Linguistics (TACL), 2021, Vol 9, pp, 845–874. MIT Press Direct.