TOSCA-MP Speech Ground Truth
This multilingual dataset was created within the TOSCA-MP project as ground truth data for the evaluation of automatic transcription and spoken language translation technologies. The dataset includes two video genres – television broadcast news and talk-shows – and covers four languages.
Besides segmentation, turn and speaker identification, and orthographic transcription, a very rich annotation on the audio signal has been carried out, both at the linguistic level (overlapped speech and foreign speech) and the acoustic level (e.g. background noise, applause and cough, music such as songs and jingles).
Orthographic transcriptions were generated by non-expert workers through crowdsourcing and revised by expert transcribers. Rich annotation was carried out by expert transcribers only.
Annotated and transcribed videos:
- Flemish: 5h:51m (news), 6h:13m (talk shows)
- English: 5h:07m (news only)
- German: 4h:03m (news), 5h:02m (talk shows)
- Italian 3h:54m (news), 7h:21m (talk shows)
Furthermore, a subset of the broadcast news data (around two hours, corresponding to about 20,000 words) was translated by professional translators in the following directions:
- Flemish to English
- English to Italian
- German to English
- German to Italian
The TOSCA-MP Speech Ground Truth is distributed under a Creative Commons Attribution 4.0 International license (CC BY 4.0). Due to copyright issues only the ground truth generated is distributed here, but corresponding videos are available (links are provided in the ground truth documentation).