E3C - European Clinical Case Corpus
E3C aims to collect and annotate a multilingual corpus of clinical narratives, ambitioning to become a reference European resource. A clinical narrative is a statement of a clinical practice, presenting the reason for a clinical visit, the description of physical exams, and the assessment of the patient’s situation. We focus on published clinical narratives because they are often de-identified, overcoming privacy issues, and are rich in clinical entities as well as temporal information, which are almost absent in other clinical documents (e.g. radiological reports). E3C will deal with three types of clinical narratives: discharge summaries, clinical cases published in journals, and clinical cases from medical training resources.
E3C will build a 5-language (Italian, English, Spanish, French and Basque) clinical narrative corpus to allow linguistic analysis, benchmarking, and training of information extraction systems. The project will build upon available resources (distributed under open access licenses) and collect new data when necessary. The goal is to harmonise current annotations, introduce new annotation layers, and provide baselines for information extraction tasks.
We foresee three types of annotations:
– clinical entities: pathologies, symptoms, procedures, and body parts, according to standard clinical taxonomies (e.g. ICD-10 and SNOMED-CT);
– temporal information: events, time expressions and temporal relations, according to the THYME TimeML standard;
– factuality: event factuality values and assessment of the effect of negation, uncertainty and hedge expressions on those values.
E3C is organised into three layers, with different purposes:
– The first layer consists of full manual annotations of clinical entities, temporal information and factuality, for benchmarking and linguistic analysis;
– The second layer consists of semi-automatic annotations of clinical entities, to be used to train baseline systems;
– The third layer consists of non-annotated medical documents (not necessarily clinical narratives) to be exploited by semi-supervised approaches.
The E3C project is organised in six following main activities, i.e. guideline definition, data collection, data annotation, quality assessment, baselines, and integration.
Projext Funding: ELG (European Language Grid) Pilot Projects Open Call 1 (Grant Agreement No. 825627 – H2020, ICT 2018-2020 FSTP)