End-to-end Spoken Language Translation in Rich Data Conditions

What:

Deep learning techniques are significantly influencing research in most of NLP tasks, outperforming traditional approaches and proposing completely different problems and solutions. Recently, end-to-end models have shown promising developments also in Spoken Language Translation (SLT), which targets the translation of a speech signal in one language into a text in another language. The task combines the main challenges of its two parent research areas: machine translation (MT) and automatic speech recognition (ASR). From MT, it inherits the challenge of translating into a language that is different from the input. This raises problems related to handling word reordering (especially when dealing with languages with different syntax) and long-range dependencies between words, as well as solving ambiguities and preserving the style and the register of the input. From ASR, it inherits the challenge of processing unsegmented audio signals, which are typically represented as very long sequences of feature banks that implicitly provide a joint relationship along the time and frequency dimensions. Traditional approaches to SLT are based on cascaded architectures that concatenate ASR and MT components. A major advantage of the cascade solution is that it builds on consolidated technology and large amounts of data available in the two fields. The drawbacks span from the engineering required to train separate components and to make them robust to error propagation. Direct, end-to-end approaches aim to overcome these limitations by exploiting the power of deep neural networks.

How:

The goal of this project is to enhance the state of the art in end-to-end SLT focusing on two orthogonal dimensions: architecture and data. On the architecture side, new solutions will be deployed to improve translation quality by supporting SLT encoders with better input representations, fully exploiting speech prosodic cues, factorizing signal and content information and meeting external stylistic constraints. On the data side, the focus will be on the exploitation of large, multilingual data featuring high speaker and content diversity. The proposed solutions will be evaluated on several translation directions using MuST-C, the largest corpus currently available for SLT research.

Why:

The outcomes of this research will take a step further towards meeting request of reliable cross-lingual subtitling software coming from the audiovisual translation market. The vast majority of video content daily produced is in English: making it accessible in different languages is a priority to spread knowledge across cultures. Current SLT/MT systems operate disregarding output constraints: customizing translations to subtitling and dubbing scenarios will increase their appeal for a fast-growing market.

Who:

The principal investigators are Marco Turchi and Matteo Negri, from the MT Research Unit at Fondazione Bruno Kessler, Italy.

Funded by:

This project is funded by an Amazon AWS Machine Learning Grant.

Publications:

  1. M. Gaido, B. Savoldi, L. Bentivogli, M. Negri and M.  Turchi. 2020. Breeding Gender-aware Direct Speech Translation Systems. In Proceedings of COLING 2020, Online, December 8-13, 2020.
  2. A. Karakanta, S. Bhattacharya, S. Nayak, T. Baumann, M. Negri and M. Turchi. 2020. The Two Shades of Dubbing in Neural Machine Translation. In Proceedings of COLING 2020, Online, December 8-13, 2020.
  3. M. A.Di Gangi, M. Gaido, M. Negri, M. Turchi. 2020. On Target Segmentation for Direct Speech Translation. In Proceedings of AMTA 2020, Virtual, October 6-9, 2020.
  4. M. Gaido, M. A. Di Gangi, M. Negri, M. Cettolo and M. Turchi. 2020. Contextualized Translation of Automatically Segmented Speech. In Proceedings of Interspeech 2020, Shanghai, Cina, October 25-29, 2020.
  5. E. Ansari, A. Axelrod, N. Bach et al. 2020. Findings of the IWSLT 2020 Evaluation Campaign. In Proceedings of the 17th International Conference on Spoken Language Translation (IWSLT 2020), Seattle, USA, July 9-10, 2020.
  6. A. Karakanta, M. Negri and  M. Turchi. 2020. Is 42 the Answer to Everything in Subtitling-oriented Speech Translation? In Proceedings of the 17th International Conference on Spoken Language Translation (IWSLT 2020), Seattle, USA, July 9-10, 2020.
  7. M. Gaido, M. A. Di Gangi, M. Negri and  M. Turchi. 2020. End-to-End Speech-Translation with Knowledge Distillation: [email protected]. In Proceedings of the 17th International Conference on Spoken Language Translation (IWSLT 2020), Seattle, USA, July 9-10, 2020.
  8. L. Bentivogli, B. Savoldi, M. Negri, M. A. Di Gangi, R. Cattoni and M. Turchi. 2020. Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics (ACL 2020), Seattle, USA, July 5-10, 2020.
  9. A. Karakanta, M. Negri and M. Turchi. 2020. MuST-Cinema: a Speech-to-Subtitles corpus. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 13-15 May, 2020.
  10. M. A. Di Gangi, V. N. Nguyen, M. Negri, and M. Turchi. 2020. Instance-based Model Adaptation for Direct Speech Translation. In Proceedings of the 45th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020), Barcelona, Spain, May 4-8, 2020.
  11. M. A. Di Gangi, M. Negri, V. N. Nguyen, A. Tebbifakhr and M. Turchi. 2019. Data Augmentation for End-to-End Speech Translation: [email protected] 19. In Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT 2019), Hong Kong, November 2-3, 2019.
  12. M. Di Gangi, M. Negri and M. Turchi. 2019. One-to-Many Multilingual End-to-End Speech Translation. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), Sentosa, Singapore, December, 14-18 2019.
  13. A. Karakanta, M. Negri and M. Turchi. 2019. Are Subtitling Corpora really Subtitle-like? In Proceedings of the Sixth Italian Conference on Computational Linguistics (CLIC-it 2019), Bari, Italy November 13-15, 2019.
  14. S. M. Lakew, M. Di Gangi, and M. Federico. 2019. Controlling the Output Length of Neural Machine Translation. In Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT 2019), Hong Kong, November 2-3, 2019.
  15. M. Di Gangi, R. Enyedi, A. Brusadin and M. Federico. 2019. Robust Neural Machine Translation for Clean and Noisy Speech Transcripts. In Proceedings of the 16th International Workshop on Spoken Language Translation (IWSLT 2019), Hong Kong, November 2-3, 2019.

Software:

  • FBK-fairseq-ST: an adaptation of FAIR’s fairseq for direct speech translation.