Doctoral thesis
OA Policy
English

TransBERT: Leveraging Automatic Translation for Domain-Specific Knowledge Transfer

Imprimatur date2025-04-10
Defense date2025-04-10
Abstract

This thesis addresses the language barrier in life science NLP by developing TransBERT, a French biomedical language model trained entirely on over 22 million machine-translated abstracts. Leveraging advances in machine translation, the study demonstrates that TransBERT achieves competitive or superior results compared to state-of-the-art French models, even without access to extensive native corpora. Additionally, the research shows that domain-specific tokenization significantly boosts performance, particularly for named entity recognition tasks. The work introduces TransCorpus, the largest French life science corpus to date, and provides a robust evaluation framework for model comparison. These findings highlight scalable strategies for building high-quality language models in low-resource language/domain pairs, paving the way for broader access to advanced NLP tools in virtually any domain or language.

Citation (ISO format)
KNAFOU, Julien David Marc. TransBERT: Leveraging Automatic Translation for Domain-Specific Knowledge Transfer. Thèse, 2025. doi: 10.13097/archive-ouverte/unige:185244
Main files (1)
Thesis
accessLevelPublic
Secondary files (1)
Imprimatur
accessLevelPublic
Identifiers
299views
623downloads

Technical informations

Creation30/05/2025 07:35:29
First validation30/05/2025 14:29:46
Update time06/02/2026 16:52:54
Status update06/02/2026 16:52:54
Last indexation06/02/2026 16:52:55
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack