en
Proceedings chapter
Open access
English

FRASIMED : A Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Published inThe Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) : main conference proceedings, Editors Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S. & Xue, N., p. 7450-7460
Presented at Torino, 20-25 May 2024
PublisherTorino : EREC; COLING
Publication date2024-05
Abstract

Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not

benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger

annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets

through crosslingual annotation projection and is freely available on GitHub (link:

https://github.com/JamilProg/crosslingual_bert_annotation_projection). Leveraging a language agnostic BERT-based

approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already

available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the

quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation

projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this

methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities

Detection (FRASIMED), an annotated corpus comprising 2’051 synthetic clinical cases in French. The corpus is now

available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in

the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical

concepts in French.

eng
Keywords
  • Annotation projection
  • Crosslingual
  • Medical entity recognition
  • Entity linking
  • Large medical annotated dataset
Citation (ISO format)
ZAGHIR, Jamil et al. FRASIMED : A Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection. In: The Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) : main conference proceedings. Torino. Torino : EREC; COLING, 2024. p. 7450–7460.
Main files (1)
Proceedings chapter (Published version)
Identifiers
  • PID : unige:177906
27views
5downloads

Technical informations

Creation05/22/2024 5:59:37 PM
First validation06/17/2024 2:47:49 PM
Update time06/17/2024 2:47:49 PM
Status update06/17/2024 2:47:49 PM
Last indexation06/17/2024 2:48:10 PM
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack