MuST-Cinema

MuST-Cinema is a Multilingual Speech-to-Subtitles corpus ideal for building subtitle-oriented machine and speech translation systems.
It comprises audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

MuST-Cinema was built by annotating MuST-C with subtitle breaks based on the original subtitle files. Special symbols have been inserted in the aligned sentences to mark subtitle breaks as follows:

  • <eob>: block break (breaks between subtitle blocks)
  • <eol>: line breaks (breaks between lines inside the same subtitle block)

Release v1.0: 7 language pairs.

English to-{Dutch,French,German,Italian,Portuguese,Romanian,Spanish}

 

How to obtain MuST-Cinema

TED talks are copyrighted by TED Conference LLC and licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 .

MuST-Cinema is released under the same Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License.

 

Reference paper

If you use MuST-Cinema in your work, please cite the following paper:

Alina Karakanta, Matteo Negri and Marco Turchi.
“MuST-Cinema: a Speech-to-Subtitles Corpus”
In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, May 13-15 2020.

 

bibtex:

@InProceedings{mustcinema20,
author = “Karakanta, Alina and Negri, Matteo and Turchi, Marco”,
title = “{MuST-Cinema: a Speech-to-Subtitles Corpus}”,
booktitle = “Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020)”,
year = “2020”,
address = “Marseille, France”,
month = “May 13-15”}