MuST-C

MuST-C is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems for speech translation from English into several languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

The latest releases of the corpus are:

  • release v1.2:
    14 language directions:
    English-to-{Arabic, Chinese, Czech, Dutch, French, German, Italian, Persian, Portuguese, Romanian, Russian, Spanish, Turkish, Vietnamese}
    (includes the 8 language directions of release v1.0)
  • release v1.1 (special release for IWSLT-2019):
    English-to-Czech language direction (TXT-only)
  • release v1.0:
    8 language directions:
    English-to-{Dutch,French,German,Italian,Portuguese,Romanian,Russian,Spanish}

MuST-C is continuously growing in size and language coverage, stay tuned for updates!

How to obtain MuST-C

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 .

MuST-C is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License.

 

Reference paper

If you use MuST-C in your work, please cite the following paper:

Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, Marco Turchi. 2020.
“MuST-C: A multilingual corpus for end-to-end speech translation”.
In Computer Speech & Language Journal.
Doi: https://doi.org/10.1016/j.csl.2020.101155