MuST-SHE

MuST-SHE is a multilingual benchmark allowing for a fine-grained analysis of gender bias in Machine Translation and Speech Translation.

MuST-SHE is a subset of the TED-based MuST-C corpus and is available for English-French, English-Italian and English-Spanish. The dataset is composed of (audio, transcript, translation) triplets annotated with qualitatively differentiated and balanced gender-related phenomena. Each triplet requires the translation of at least one English gender-neutral word into the corresponding masculine/feminine target word(s). MuST-SHE comprises 3,367 triplets (1,164 for En-Es, 1,108 for En-Fr, and 1,095 for En-It) uttered by 295 different speakers. Also, a common subset of 1,040 instances allows for comparative evaluations of gender translation across the three language directions.

Data Statements for MuST-SHE v1.1 are available HERE. New Statements for v1.2 are coming soon.

 

How to obtain MuST-SHE

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0.

MuST-SHE is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 (CC BY NC ND 4.0 International) license, and is freely downloadable.

 

Old Versions

 

Reference paper

If you use MuST-SHE  in your work, please cite the following paper:

Luisa Bentivogli, Beatrice Savoldi, Matteo Negri, Mattia Antonino Di Gangi, Roldano Cattoni, Marco Turchi.
“Gender in danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus”.
In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pages 6923 – 6933, Online, July 2020.

 

Related resources for research on gender translation

These resources are presented in the following paper, which was awarded the nomination of outstanding paper by COLING’2020:

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
Breeding Gender-Aware Direct Speech Translation Systems“.
In Proceedings of the  28th International Conference on Computational Linguistics (COLING’2020), December 8-13 2020, Online, pp 3951-3964.

 

Other papers on gender translation

Marco Gaido, Beatrice Savoldi, Luisa Bentivogli, Matteo Negri and Marco Turchi.
How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation“.
In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, August 2-4, Online, pp 3576-3589

Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Matteo Negri and Marco Turchi.
Gender Bias in Machine Translation.
In Transactions of the Association for Computational Linguistics (TACL), 2021, Vol 9, pp, 845–874. MIT Press Direct.