MuST-SHE

MuST-SHE is a multilingual benchmark allowing for a fine-grained analysis of gender bias in Machine Translation and Speech Translation.

MuST-SHE is a subset of the TED-based MuST-C corpus and is available for English-French, English-Italian and English-Spanish. The dataset is composed of (audio, transcript, translation) triplets annotated with qualitatively differentiated and balanced gender-related phenomena. Each triplet requires the translation of at least one English gender-neutral word into the corresponding masculine/feminine target word(s). MuST-SHE comprises 3,367 triplets (1,164 for En-Es, 1,108 for En-Fr, and 1,095 for En-It) uttered by 295 different speakers. Also, a common subset of 1,040 instances allows for comparative evaluations of gender translation across the three language directions.

MuST-SHE Data Statements are available HERE.

 

How to obtain MuST-SHE

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0.

MuST-SHE is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY NC ND 4.0 International) license, and is freely downloadable.

 

Reference paper

If you use MuST-SHE  in your work, please cite the following paper:

Luisa Bentivogli, Beatrice Savoldi, Matteo Negri, Mattia Antonino Di Gangi, Roldano Cattoni, Marco Turchi.
“Gender in danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus”.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pages 6923 – 6933, Online, July 2020.

 

Related resources for research on gender translation

These resources are presented in the following paper, which was awarded the nomination of outstanding paper by COLING’2020:

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
Breeding Gender-Aware Direct Speech Translation Systems
In Proceedings of the  28th International Conference on Computational Linguistics (COLING’2020), December 8-13 2020, Online, pp 3951-3964.