MuST-SHE is a multilingual benchmark allowing for a fine-grained analysis of gender bias in Machine Translation and Speech Translation.
MuST-SHE is a subset of the TED-based MuST-C corpus and is available for English-French, English-Italian and English-Spanish. The dataset is composed of (audio, transcript, translation) triplets annotated with qualitatively differentiated and balanced gender-related phenomena. Each triplet requires the translation of at least one English gender-neutral word into the corresponding masculine/feminine target word(s). MuST-SHE comprises 3,367 triplets (1,164 for En-Es, 1,108 for En-Fr, and 1,095 for En-It) uttered by 295 different speakers. Also, a common subset of 1,040 instances allows for comparative evaluations of gender translation across the three language directions.
MuST-SHE Data Statements are available HERE.
How to obtain MuST-SHE
TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0.
MuST-SHE is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY NC ND 4.0 International) license, and is freely downloadable.
If you use MuST-SHE in your work, please cite the following paper:
Luisa Bentivogli, Beatrice Savoldi, Matteo Negri, Mattia Antonino Di Gangi, Roldano Cattoni, Marco Turchi.
“Gender in danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus”.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pages 6923 – 6933, Online, July 2020.
Related resources for research on gender translation
- MuST-Speakers: Annotation of MuST-C talks with speakers’ gender information
- MuST-C Gender-balanced Validation Set: a new MuST-C validation set specifically designed to train ST systems for experiments on gender translation.
These resources are presented in the following paper, which was awarded the nomination of outstanding paper by COLING’2020:
Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
“Breeding Gender-Aware Direct Speech Translation Systems“
In Proceedings of the 28th International Conference on Computational Linguistics (COLING’2020), December 8-13 2020, Online, pp 3951-3964.