Doctoral thesis
OA Policy
English

TransBERT: Leveraging Automatic Translation for Domain-Specific Knowledge Transfer

Imprimatur date2025-04-10
Defense date2025-04-10
Abstract

This thesis addresses the language barrier in life science NLP by developing TransBERT, a French biomedical language model trained entirely on over 22 million machine-translated abstracts. Leveraging advances in machine translation, the study demonstrates that TransBERT achieves competitive or superior results compared to state-of-the-art French models, even without access to extensive native corpora. Additionally, the research shows that domain-specific tokenization significantly boosts performance, particularly for named entity recognition tasks. The work introduces TransCorpus, the largest French life science corpus to date, and provides a robust evaluation framework for model comparison. These findings highlight scalable strategies for building high-quality language models in low-resource language/domain pairs, paving the way for broader access to advanced NLP tools in virtually any domain or language.

Citation (ISO format)
KNAFOU, Julien David Marc. TransBERT: Leveraging Automatic Translation for Domain-Specific Knowledge Transfer. Thèse, 2025. doi: 10.13097/archive-ouverte/unige:185244
Main files (1)
Thesis
accessLevelPublic
Secondary files (1)
Imprimatur
accessLevelPublic
Identifiers
321views
868downloads

Technical informations

Creation05/30/2025 7:35:29 AM
First validation05/30/2025 2:29:46 PM
Update time02/06/2026 4:52:54 PM
Status update02/06/2026 4:52:54 PM
Last indexation02/06/2026 4:52:55 PM
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack