Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

Segonne, Vincent; Mannion, Aidant; Alonzo Canul, Laura Cristina; Audibert, Alexandre; Liu, Xingyu; Macaire, Cécile; Pupier, Adrien; Zhou, Yongxin; Aguiar, Mathilde; Herron, Felix; Norré, Magali; Amini, Massih-Reza; Bouillon, Pierrette; Eshkol-Taravella, Iris; Esperança-Rodier, Emmanuelle; François, Thomas; Goeuriot, Lorraine; Goulian, Jérôme; Lafourcade, Mathieu; Lecouteux, Benjamin; Portet, François; Ringeval, Fabien; Vandeghinste, Vincent; Coavoux, Maximin; Dinarelli, Marco; Schwab, Didier

Proceedings chapter

English

Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

Presented atTorino, Italy, 20-25 May 2024

Published inNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue (Ed.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), p. 9463-9476

PublisherELRA and ICCL

Publication date2024

Abstract

Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.

Keywords

Self-supervised learning
Pretrained language models
Evaluation benchmark
Biomedical document processing
Legal document processing
Speech transcription

Affiliation entities

Faculté de traduction et d'interprétation / Département de traitement informatique multilingue

Funding

Swiss National Science Foundation - PRojection du langage Oral vers des unités PICTOgraphiques - PROPICTO [197864]

Citation (ISO format)

SEGONNE, Vincent et al. Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue (Ed.). Torino, Italy. [s.l.] : ELRA and ICCL, 2024. p. 9463–9476.

Proceedings chapter (Published version)

CC BY-NC-4.0

Identifiers

PID : unige:177151

Additional URL for this publicationhttps://aclanthology.org/2024.lrec-main.827

200views

190downloads

Creation18/05/2024 09:02:56

First validation21/05/2024 08:22:30

Update time21/05/2024 08:22:30

Status update21/05/2024 08:22:30

Last indexation17/12/2024 15:38:43

Archive ouverte UNIGE

Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

Technical informations