- CRACKER (Cracking the Language Barrier: Coordination, Evaluation and Resources for European MT Research)
CRACKER is a Machine Translation (MT) research initiative involving 7 major universities and research institutions (DFKI, FBK, Charles University, University of Edinburgh, University of Sheffield, Athena Research and Innovation Center in Information, Communication and Knowledge Technologies, and ELDA). Its goal is to cope with the striking disproportion between the scope of current MT challenges and the available resources, especially for translation to and from languages that have only fragmentary or no technological support at all. The nucleus of CRACKER’s research, development, and innovation strategy towards high-quality MT is the group of projects funded through H2020-ICT-17a/b (partly extending to relevant FP7 actions such as QTLeap, LIDER and MLi). This nucleus will be supported by CRACKER (ICT-17c) in coordination, evaluation and resources. In order to achieve its challenging goals, CRACKER will build upon, consolidate and extend initiatives for collaborative MT research supported by earlier EU-funded actions. These include evaluation campaigns such as the Workshop on Statistical Machine Translation (WMT) and the International Workshop on Spoken Language Translation (IWSLT), the META-SHARE open infrastructure for sharing language resources and technologies with extensions for MT assembled by QTLaunchPad, and open-source tool building and training (MT Marathons). Coordination, communication and outreach to user communities will build upon existing networks and communication infrastructures such as the META-FORUM event series and strong involvement of industrial associations.
Date: Thursday, 1 January, 2015 to Sunday, 31 December, 2017
Funding: Horizon 2020 – CSA
- MMT (Modern Machine Translation)
The goal of MMT is to deliver a language independent commercial online translation service based on a new open-source machine translation distributed architecture. MMT does not require any initial training phase. Once fed with training data MMT will be ready to translate. MMT de-facto will merge translation memory and machine translation technology into one single product. Quality of translations will increase as soon as new training data are added. MMT manages context automatically so that it will not require building domain specific systems. MMT will provide best translation quality for any topic/domain by storing training segments together with context linking information. MMT enables scalability of data and users so that no more expensive ad-hoc hardware installations are needed. The MMT architecture will support high performance and linear scalability up to thousands of nodes. The same software will work to set-up a personal translation system or to create a web-based service on a cluster of commodity nodes able to handle terabytes of data and millions of users.
MMT will create a data collection infrastructure that accelerates the process of filling the data gap between large IT companies and the MT industry. MMT will leverage the data crawled on the web by Common Crawl, TAUS, Translated’s MyMemory and Matecat data and facilities to set up a processing pipeline that will create unprecedented amounts of clean parallel and monolingual data to develop machine translation systems.
Date: Thursday, 1 January, 2015 to Sunday, 31 December, 2017
Funding: Horizon 2020 – IA
- Translated Sponsorship
We are glad to acknowledge a gift by Translated Srl to Marcello Federico in order to support PhD students working on open source projects during 2016.
Date: Thursday, 16 June, 2016
Funding: Financial gift
- eBay Sponsorship
Our unit is glad to acknowledge a financial gift from eBay Inc. to Marcello Federico that will support research in machine translation by his PhD students during 2015. The gift has been actually employed to co-fund the yearly bursaries of the students Prashant Mathur and Jose Camargo de Souza, as well their travel expenses for an extended visit at eBay’s labs in San Jose, California.
Date: Thursday, 1 January, 2015 to Tuesday, 30 June, 2015
Funding: Financial gift
- META-NET (A Network of Excellence forging the Multilingual Europe Technology Alliance)
META-NET is a Network of Excellence dedicated to fostering the technological foundations of a multilingual European information society. Language Technologies will enable communication and cooperation across languages, secure users of any language equal access to information and knowledge, build upon and advance functionalities of networked information technology. A concerted, substantial, continent-wide effort in language technology research and engineering is needed for realising applications that enable automatic translation, multilingual information and knowledge management and content production across all European languages. This effort will also enhance the development of intuitive language-based interfaces to technology ranging from household electronics, machinery and vehicles to computers and robots.
- EU-Bridge (Bridges Across the Language Divide)
EU-BRIDGE aims at developing automatic transcription and translation technology that will permit the development of innovative multimedia captioning and translation services of audiovisual documents between European and non-European languages. The project will provide streaming technology that can convert speech from lectures, meetings, and telephone conversations into the text in another language. Therefore EU-BRIDGE intends to put together academics, engineering and business expertise in order to create competitive offers to existing needs of translation, communication, content processing and publishing. The four use cases are: Captioning Translation for TV broadcasts, University Lecture Translations, European Parliament Translations, Unified Communication Translation. The prospective users of the project are European companies operating in an audiovisual market (in particular TV captioning and translation).
- MosesCore (Promoting Open-Source Machine Translation)
MosesCore aims to encourage the development and usage of open source machine translation tools. It will achieve this by organising:
(1) Academic workshops and evaluation campaigns in order to publish and compare the latest research in machine translation
(2) Machine translation marathons, to implement the latest machine translation techniques, and to discuss and present recent implementations
(3) Industrial outreach events to provide tutorials and share knowledge on the use of open source machine translation.
MosesCore will also coordinate and support the development of open source software for machine translation, notably the Moses statistical MT toolkit.
This will result in at least three major releases of Moses, one in each year of the project.
MosesCore is an EU-funded Coordination Action, which brings together academic and commercial partners sharing a common interest in open source machine translation
Date: Wednesday, 1 February, 2012 to Saturday, 31 January, 2015
Duration: 36 months
Partners: University of Edinburgh, TAUS, Fondazione Bruno Kessler, Charles University, Capita Translation and Interpreting
- EXCITEMENT (EXploring Customer Interactions through Textual EntailMENT)
There are two interleaved high-level goals for this project. The first is to set up, for the first time, a generic architecture and a comprehensive implementation for a multilingual textual inference platform and to make it available to the scientific and technological communities. The second goal of the project is to develop a new generation of inference-based industrial text exploration applications for customer interactions, which will enable businesses to better analyze and make sense of their diverse and often unpredicted client content. These goals will be achieved for three languages, i.e. English, German and Italian, and for three customer interaction channels, i.e. speech (transcriptions), email and social media.
- MateCat (Machine Translation Enhanced Computer Assisted Translation)
MateCat pushes what is considered the new frontier of Computer Assisted Translation (CAT) technology, that is, how to effectively and ergonomically integrate Machine Translation (MT) within the human translation workflow. While today MT is mainly trained with the objective of creating the most comprehensible output, in MateCat we target MT technology that will minimize the translator’s post-edit effort. To this end, MateCat is developing an enhanced web-based CAT tool that will offer new MT capabilities, such as automatic adaption to the translated content, online learning from user corrections, and automatic quality estimation.
- EuroSentiment (Language Resource Pool for Sentiment Analysis in European Languages)
The main concept of the project EuroSentiment is to provide a shared language resource pool for fostering sentiment analysis. To this end, a detailed collection of requirements of language resource providers and users will be performed with the aim of satisfying their needs in terms of multilingual, quality and domain coverage. Then, next step will be providing interoperability between resources, cleaning and aligning them in order to provide a homogeneous interface.
TOSCA-MP (Task-oriented search and content annotation for media production)
The TOSCA-MP project aims to develop user-centric content annotation and search tools for professionals in networked media production and archiving (television, radio, online), addressing their specific use cases and workflow requirements. The project brings together 10 partners from 6 European countries including industry partners providing solutions for the media industry, public service broadcasters as well as their European association, a university and research centres. TOSCA-MP investigates scalable and distributed content processing methods performing advanced multimodal information extraction and semantic enrichment. Other key technology areas include search methods across heterogeneous networked content repositories and novel user interfaces. An open standards based service oriented framework integrates the components of the system.
- Cosyne (Multi-Lingual Content Synchronization with Wikis)
The combination of dynamic user-generated content and multilingual aspects is particularly prominent in Wiki sites. Wikis have gained increased popularity over the last few years as a means of collaborative content creation as they allow users to set up and edit web pages directly. A growing number of organizations use Wikis as an efficient means to provide and maintain information across several sites. Currently, multilingual Wikis rely on users to manually translate different Wiki pages on the same subject. This is not only a time-consuming procedure but also the source of many inconsistencies, as users update the different language versions separately, and every update would require translators to compare the different language versions and synchronize the updates. The overall aim of the CoSyne project is to automate the dynamic multilingual synchronization process of Wikis.
- PESCaDO (Personalized Environmental Service Configuration and Delivery Orchestration)
In the present-day Europe, with its well-established national air quality and meteorological networks, there are solid ties between the air quality and meteorological agencies and seemingly well-connected data distribution networks. However, there is an increasing need for the orchestration of environmental services spread across the Web in order to provide users with personalised decision support or tailored environmental information. PESCaDO aims to meet this need for environmental service orchestration. It will offer an interconnected multipurpose environmental user-oriented service for a federated community of citizens, public services (such as tourist offices and environmental institutions), public administrations, and entrepreneurs active in sectors sensitive to environmental conditions.
- FLaReNet (Fostering Language Resources Network)
A major condition for the take-off of the field of Language Resources and Language Technologies is the creation of a shared policy for the next years. FLaReNet aims at developing a common vision of the area and fostering a European strategy for consolidating the sector, thus enhancing competitiveness at EU level and worldwide. By creating a consensus among major players in the field, the mission of FLaReNet is to identify priorities as well as short, medium, and long-term strategic objectives and provide consensual recommendations in the form of a plan of action for EC, national organisations and industry. Through the exploitation of new collaborative modalities as well as workshops and meetings, FLaReNet will sustain international cooperation and (re)create a wide Language community.
- EuroMatrixPlus (Bringing Machine Translation for European Languages to the User)
During its three-year term, EuroMatrixPlus has had impressive and significant impact on the development of the field of machine translation (MT) in Europe, and also world-wide. A successor to the well-known successful EuroMatrix project, in which MT systems for a wide range of European language pairs were built, EuroMatrixPlus went from March 2009 until April 2012 (38 months) with a budget of 5.94 MEUR, funded by the European Commission’s Seventh Framework Programme under contract 231720. The project was coordinated by DFKI, the German Research Center for Artificial Intelligence, with its Language Technology Lab in Saarbrücken. Hans Uszkoreit was the principal investigator and scientific coordinator.
The project’s focus was on “bringing MT to the user”. The open-source system Moses, the SMT toolkit most widely adopted not only by academia but also by the translation industry, was further developed by the project. In addition to cutting-edge research in SMT and hybrid approaches to MT, in which rule-based and statistical components are combined in various ways to benefit from the strengths of both approaches, the project has organized several “MT Marathons” and continued the annual evaluation campaigns with shared tasks on burning issues, all widely recognized by the field. The organization of specialized workshops with industrial users, the release of resources and software and a total of 196 scientific publications complement the success story of EuroMatrixPlus.
- Speech Analytics
The project includes technology transfer activities from FBK-irst to Pervoice and development activities such as improvements of automatic transcription technology (rich transcription, automatic text polishing), speech analytics technologies for call centers (emotional state recognition, segmentation and classification of utterances, monitoring of transactions), and advanced acoustic normalization techniques.
- LiveMemories (Active Digital Memories of Collective Life)
From a scientific/technical perspective, LiveMemories aims at scaling up content extraction techniques towards very large scale extraction from multimedia sources, setting the scene for a Content Management Platform for Trentino; using this information to support new ways of linking, summarizing and classifying data in a new generation of digital memories which are `alive’ and user-centered; and to turn the creation of such memories into a communal web activity. Achieving these objectives will make Trento a key player in the new Web Science Initiative, digital memories, and Web 2.0. But LiveMemories is also intended to have a social and cultural impact besides the scientific one: through the collection, analysis and preservation of digital memories of Trentino; by facilitating and encouraging the preservation of such community memories; and the fostering of new forms of community, and enrichment of our cultural and social heritage.
- JUMAS (Judicial Management by Digital Libriaries Semantics)
JUMAS addresses the need to build an infrastructure able to optimise the information workflow in order to facilitate later analysis. New models and techniques for representing and automatically extracting the embedded semantics derived from multiple data sources will be developed. The most important goal of the JUMAS system is to collect, enrich and share multimedia documents annotated with embedded semantic minimising manual transcription activity. JUMAS is tailored at managing situations in which multiple cameras and audio sources are used to record assemblies in which people debates and event sequences need to be semantically reconstructed for future consultations. The prototype of JUMAS will be tested interworking with legacy systems, but the system can be viewed as able to support business processes and problem-solving in a variety of domains.
AINEVA is the association of Italian regions and autonomous provinces which include Alps, whose goal is to coordinate efforts that local members play in the prevention and information in the field of snow and avalanches. On a daily basis, AINEVA members compile and make available bulletins of conditions and avalanche forecasts written in Italian. In order to allow their consultation to non-Italian people (e.g. foreign tourists), AINEVA members provide translations of original bulletins into languages such as English, French, German and Slovenian. AINEVA and FBK started a 20-month collaboration at the end of 2009 with the goal of developing a system for the automatic translation of such bulletins into English, French and German.
- ATLAS (Automatic Translation into Sign Language)
The project ATLAS creates the technological bridge between cognitive sciences and the most progressive information technologies. The project is patronized by the region of Piedmont and financed by the program of development of innovative services and it gives the possibility to deaf people to look and understand programs of mass media through automatic translation from the Italian written language into the sign language (LIS) which is visualized a virtual actor created by computer animation drawing means. Through these tools the project achieves the aim to give the possibility to deaf people to understand television programs, WEB pages and films reproduced on DVD by the virtual translator personated in time which can be visualized on different types of display from television screen to computer, from mobile phones to PDAs.
- Meteo Trentino
Meteo Trentino and FBK started a collaboration in 2009 with the goal of developing a translation system for weather forecast bulletins. On a daily basis, Meteo Trentino compiles and makes available through the WEB site a weather bulletin related to the Trentino area written in Italian. An equivalent or shortened version of that bulletin is also published in English and German after a manual translation made by forecasters. FBK has been asked to provide forecasters with a system for the automatic translation of Italian bulletins into those two languages. In the course of 2009, FBK has developed the translation system and a simple interface (shown in the figure below) which have been delivered to Meteo Trentino, whose forecasters are currently using. The project has been extended to 2010 with the goal of improving the quality of the automatic translation by exploiting the feedback from real users.
- TALES (Trattamento Automatico delle Lingue Ladina e Sarda)
Project on the creation of resources and infrastructure for two Italian minority languages, i.e. Ladino and Sardo.
- X-Media (Knowledge Management across Media)
X-Media addresses the issue of knowledge management in complex distributed environments. It studies, develops and implements large scale methodologies and techniques for knowledge management able to support sharing and reuse of knowledge that is distributed in different media (images, documents and data) and repositories (data bases, knowledge bases, document repositories, etc.).
- QALL-ME (Question Answering Learning technologies in a multiLingual and Multimodal Environment)
The scientific and technological objectives of the project pursue three crucial directions: multilingual open domain QA, user-driven and context-aware QA, and learning technologies for QA. The specific research objectives of the project include state-of-art advancements in the complexity of the questions handled by the system(e.g. how questions); the development of a web-based architecture for cross-language QA (i.e. question in one language, answer in a different language); the realization of real time QA systems for concrete applications; the integration of the temporal and spatial context both for question interpretation and for answer extraction; the development of a robust framework for applying minimally supervised machine learning algorithms to QA tasks; and the integration of mature technologies for automatic speech recognition within the open domain question answering framework.
- ITCH (Intelligent Technologies for Cultural Visits and Mobile Education)
The project is concerned with new concepts for intelligent technologies supporting museum visits. Various areas of artificial intelligence are involved, including the production of dynamic presentations, group interaction support, user modelling, natural language processing, reasoning, tabletop collaborative systems, human-computer interaction.
- PATExpert (Advanced Patent Document Processing Techniques)
Tasks that ultimately require knowledge-based multimedia techniques (content-oriented search, assessment, abstracting, etc.) are still to a major extent carried out manually. PATExpert’s overall scientific goal is to change the paradigm currently followed for patent processing from textual (viewing patents as text blocks enriched by “canned” picture material, sequences of morpho-syntactic tokens, or collections of syntactic structures) to semantic (viewing patents as multimedia knowledge objects) processing. PATExpert will develop a multimedia content representation formalism based on Semantic Web technologies for selected technology areas and investigate the retrieval, classification, multilingual generation of concise patent information, assessment and visualization of patent material encoded in this formalism, taking the information needs of all user types as defined in a user typology into account. PATExpert’s technological goal is to develop a showcase that demonstrates the viability of PATExpert’s approach to content representation for real applications. The composition and the competence of the Consortium, ensure the achievement of these goals.
- ONTOTEXT (From Text to Knowledge for the Semantic Web)
Based on the philosophy of the Semantic Web, Ontotext exploits text processing and automatic reasoning technologies to extract knowledge from texts and organise it conceptually in an ontology. Unlike common search engines, the Ontotext Portal directly accesses the concepts and entities of the ontology and presents the user with structured information instead of mere portions of texts. For each entity, the Ontotext Portal offers four different views: Articles (lists all the documents in which it is mentioned), Citografo (shows how often it is mentioned), Opinions (shows how often opinions are expressed about it), and Record (provides extra information about it)
- TC-STAR (Technology and Corpora for Speech to Speech Translation)
The TC-STAR project is envisaged as a long-term effort to advance research in all core technologies for Speech-to-Speech Translation (SST). SST technology is a combination of Automatic Speech Recognition (ASR), Spoken Language Translation (SLT) and Text to Speech (TTS) (speech synthesis). The objectives of the project are ambitious: making a breakthrough in SST that significantly reduces the gap between human and machine translation performance. The project targets a selection of unconstrained conversational speech domains—speeches and broadcast news—and three languages: European English, European Spanish, and Mandarin Chinese. Accurate translation of unrestricted speech is well beyond the capability of today’s state-of-the-art research systems. Therefore, advances are needed to improve the state-of the-art technologies for speech recognition and speech translation.
- MEANING (Developing Multilingual Web-scale Language Technologies)
MEANING will be concerned with automatically collecting and analysing language data from the WWW on a large scale, and building more comprehensive multilingual lexical knowledge bases to support improved word sense disambiguation (WSD). Current web access applications are based on words; MEANING will open the way for access to the Multilingual Web based on concepts, providing applications with capabilities that significantly exceed those currently available. MEANING will facilitate development of concept-based open domain Internet applications (such as Question/Answering, Cross Lingual Information Retrieval, Summarisation, Text Categorisation, Event Tracking, Information Extraction, Machine Translation, etc.). Furthermore, MEANING will supply a common conceptual structure to Internet documents, thus facilitating knowledge management of web content.
- Dot.Kom (Designing adaptive informatiOn exTraction from text for KnOwledge Management)
The consortium will study, design and implement innovative methodologies for KM based on the use of IE. From the scientific point of view we will focus on two aspects that are symmetric: how the use in KM poses requirements and challenges to IE and how the use of IE changes KM. From the practical point of view we will define tools and methodologies for IE-based KM.
- FAME (Facilitating Agent for Multicultural Exchange)
Advances in IT are making possible new tools for human-human communication. Integration of speech, vision and dialog offers the possibility of a new class of tools to aid communication between people from different cultures using different languages.
The project will address the problem of integrating multiple communications modalities blending the physical and virtual worlds to provide support for multicultural communication and problem solving. The major challenges will be automatic perception of human action and understanding of human free dialog between people from different cultures. The consortium will construct an information butler, which demonstrates context awareness in a problem-solving scenario using computer vision, speech and dialog modelling.
- Web-FAQ – (Web Flexible Access and Quality)
In the context of three problem areas identified as critical for the future development of the Internet, WebFAQ aims at addressing the problem of theanalysis and representation of the information content. More specifically, the project concentrates on the access to information contained in very large, unstructured, heterogeneous repositories; on multimodal presentation of information, and on the assessment of the quality of information.
- PF-Star (Preparing Future Speech Translation Research)
The PF-STAR project intends to contribute to establish future activities in the field of multisensorial and multilingual communication (interface technologies) on firmer bases by providing technological baselines, comparative evaluations, and assessment of prospects of core technologies, which future research and development efforts can build from. To this end, the project will address three crucial areas: technologies for speech-to-speech translation, the detection and expressions of emotional states, and core speech technologies for children. For each of them, promising technologies/approaches will be selected, further developed and aligned towards common baselines. The results will be assessed and evaluated with respect to both their performances and future prospects. To maximise the impact, the duration of the project is limited to 24 months, and the workplan has been designed to delivered results in two stages: at mid-project term (month 14), and at the end of the project. This will permit to make relevant results available as soon as possible, and in particular on time for them to be used during the preparatory phase of the first call of FP6. The Lehrstuhl für Informatik 6 is involved in the comparative evaluation and further development of speech translation technologies. The statistical approach is to be compared to an interlingua based approach. After the evaluation phase, the two approaches are to be further developed and aligned towards common baselines. PF-STAR is supported by the European Union.
- EDAMOK (Enabling Distributed and Autonomous Management of Knowledge)
The project EDAMOK (Enabling Distributed and Autonomous Management of Knowledge) aims at promoting a distributed approach to knowledge management, namely an approach based on the two following principles: (i) Principle of Autonomy: each organizational unit should be allowed a large degree of autonomy in managing (creating, representing, organizing, selecting, sharing) its own knowledge (“local” knowledge); (ii) Principle of Coordination: knowledge sharing across organizational units should be thought of as a form of coordination between multiple autonomous perspectives rather than as a process of creating (and imposing) a supposedly shared knowledge structure. The goal of EDAMOK is to develop (i) a theoretical framework, a (ii) methodology, and (iii) a collection of technological tools to support this distributed and autonomous approach to knowledge management.
- ECHO (European Chronicles on-Line)
The main objectives of the project are to develop a long-term reusable software infrastructure to support digital film archives, to provide Web-based access to collections of historical documentary films of great international value and to increase the productivity and cost effectiveness of producing digital film archives. The project will develop and demonstrate an open architecture approach to distributed digital film archive services. The open architecture will support service extensibility and interoperability. The distinct features of the ECHO system will be semi-automatic metadata extraction and acquisition from digital film information, non-English speech recognizers (Italian, French, Dutch) for the purpose of indexing, searching and retrieval, cross-language retrieval capabilities, intelligent access to digital films, automatic film summary creation, collection mechanisms, privacy and billing mechanisms.
- CoreTex (Improving Core Speech Recognition Technology)
Speech, being the most natural means of human interaction, can also be viewed as one of the most important modalities in complex human-machine interaction. State-of-the-art speech recognition technology still lacks robustness with respect to environmental conditions and speaking style. In addition most research effort has been devoted to American English. The Coretex project aims at improving current technology by making it less sensitive to environment and linguistic factors, and more suitable to European languages. This will be achieved by fast system development and generic core technology; dynamic adaptation techniques and methods to enrich transcriptions. Evaluation and demonstration frameworks will be established to assess and illustrate improvement. An associated industrial user panel will have early access to project results and help to select the most pressing research issues and problems common to important application domains.
- NESPOLE (Speech-to-Speech translation in e-commerce/service scenarios)
NESPOLE! System has been developed using two scenarios: the tourism scenario and the first aid medical assistance scenario. During the project life three main data collection have been carried on in order to develop the first and the second showcase. During the first year 191 dialogues have been collected. There are 62 German dialogues recorded, 61 Italian, 37 English and 31 French. Particularly an amount of 6 hours of dialogues for Italian and French, 7 hours for English, 8 hours for German has been recorded. Dialogues were about five predefined tourism scenarios. During the last year two major data collections have been carried on: the first one aimed at expanding the tourism scenario and the second one at addressing the medical domain. For the monolingual data collection five tourism scenarios were developed; 66 dialogues were recorded yielding 994.57 minutes of data: 243.52 minutes comprised in sixteen English dialogues, 246 minutes in sixteen German dialogues, 272.52 minutes in seventeen French dialogues and 232.53 minutes in seventeen Italian dialogues. The data collection on the medical domain involved Italian, English and German languages. A total of 49 dialogues were collected. The recording results in a total of 8 hours 25 minutes of audio files.
- TAL (Trattamento Automatico della Lingua Italiana)
The HLT group was involved in three tasks: ItalwordNet, NLP architectures, and Treebank for Italian.
- FACILE (Fast and Accurate Categorization of Information by Language Engineering)
The FACILE project aimed at the development of a system for the categorisation of texts from the area of finance and business news in an exact and specific way. The intended users of the system were institutions from finance and commerce that have a vital interest in up-to-date business information stemming from online news agencies and periodicals. An important consideration in FACILE has been its use across country and language borders. The possibility to process texts in various languages and to derive factual information in a language independent, formatted form shall should allow for the rapid dissemination of information across borders.