Machine Translation - Resources

Software

  • Moses – A statistical machine translation system
  • IRSTLM – A toolkit featuring algorithms and data structures to store and access very large n-gram language models
    online
  • MGIZA++ – An extension of MGIZA++, which allows to align sentence pair in an online mode
  • AQET – Adaptive Quality Estimation tool for Machine Translation
  • ModernMT – A neural adaptive machine translation system that adapts to context and learns from corrections

Corpora

  • BinQE – A Machine Translation Dataset Annotated with Binary Quality Judgements
  • BitterCorpus – English-Italian corpus with annotated bilingual terms in IT domain
  • CLTE Benchmark – Cross-Lingual Textual Entailment Dataset
  • eSCAPE – a Large-scale Synthetic Corpus for Automatic Post-Editing
  • Heroes-ON-OFF – an annotation of dubbing segments based on the Heroes corpus
  • MAGMATic – Italian-English multi-domain academic gold standard with manual annotation of terminology
  • MuST-C – a Multilingual Speech Translation Corpus
  • MuST-C Common Test Set: Additional reference translations (post-edits) for English-German/Italian/Spanish
  • MuST-Cinema – a Speech-to-Subtitles corpus
  • MuST-SHE – a multilingual benchmark for the evaluation of gender bias in Machine Translation and Speech Translation
  • MuST-Speakers – Annotation of MuST-C talks with  speakers’ gender information
  • MuST-C Gender-balanced Validation Set – New MuST-C validation set balanced with respect to speakers’ gender
  • RTE3-derived CLTE dataset – A cross-lingual entailment corpus, obtained by translating the RTE-3 dataset
  • TOSCA-MP Speech Ground Truth – A multilingual dataset of news and talk show transcriptions and translations
  • WAGS – English-Italian Word Alignment Gold Standard
  • WIT3 – A ready-to-use version for MT research purposes of the multilingual transcriptions of TED talks

Models

  • ST fairseq – models for Speech Translation based on the fairseq python package