Doctoral thesis
OA Policy
English

TransBERT: Leveraging Automatic Translation for Domain-Specific Knowledge Transfer

Imprimatur date2025-04-10
Defense date2025-04-10
Abstract

This thesis addresses the language barrier in life science NLP by developing TransBERT, a French biomedical language model trained entirely on over 22 million machine-translated abstracts. Leveraging advances in machine translation, the study demonstrates that TransBERT achieves competitive or superior results compared to state-of-the-art French models, even without access to extensive native corpora. Additionally, the research shows that domain-specific tokenization significantly boosts performance, particularly for named entity recognition tasks. The work introduces TransCorpus, the largest French life science corpus to date, and provides a robust evaluation framework for model comparison. These findings highlight scalable strategies for building high-quality language models in low-resource language/domain pairs, paving the way for broader access to advanced NLP tools in virtually any domain or language.

Citation (ISO format)
KNAFOU, Julien David Marc. TransBERT: Leveraging Automatic Translation for Domain-Specific Knowledge Transfer. Thèse, 2025. doi: 10.13097/archive-ouverte/unige:185244
Main files (1)
Thesis
accessLevelPublic
Secondary files (1)
Imprimatur
accessLevelPublic
Identifiers
355views
1105downloads

Technical informations

Creation30/05/2025 07:35:29
First validation30/05/2025 14:29:46
Update05/06/2026 12:50:51
Status update05/06/2026 12:50:51
Last indexation05/06/2026 12:50:52
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack