Proceedings chapter
OA Policy
English

Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants

Presented atMexico City, Mexico, 16-21 June
Published inKevin Duh, Helena Gomez, Steven Bethard (Ed.), Findings of the Association for Computational Linguistics: NAACL 2024, p. 3394-3402
PublisherAssociation for Computational Linguistics (ACL)
First online date2024-06
Abstract

Conservation of historical documents benefits from computational methods by alleviating the manual labor related to digitization and modernization of textual content. Languages usually evolve over time and keeping historical wordforms is crucial for diachronic studies and digital humanities. However, spelling conventions did not necessarily exist when texts were originally written and orthographic variations are commonly observed depending on scribes and time periods. In this study, we propose to automatically normalize orthographic wordforms found in historical archives written in Middle French during the 16th century without fully modernizing textual content. We leverage pre-trained models in a low resource setting based on a manually curated parallel corpus and produce additional resources with artificial data generation approaches. Results show that causal language models and knowledge distillation improve over a strong baseline, thus validating the proposed methods.

Research groups
Citation (ISO format)
RUBINO, Raphaël et al. Normalizing without Modernizing: Keeping Historical Wordforms of Middle French while Reducing Spelling Variants. In: Findings of the Association for Computational Linguistics: NAACL 2024. Kevin Duh, Helena Gomez, Steven Bethard (Ed.). Mexico City, Mexico. [s.l.] : Association for Computational Linguistics (ACL), 2024. p. 3394–3402. doi: 10.18653/v1/2024.findings-naacl.215
Main files (1)
Proceedings chapter (Published version)
Identifiers
ISBN979-8-89176-119-3
45views
23downloads

Technical informations

Creation12/08/2024 08:43:31
First validation13/08/2024 15:59:52
Update time09/09/2024 10:56:45
Status update09/09/2024 10:11:01
Last indexation17/12/2024 16:38:55
All rights reserved by Archive ouverte UNIGE and the University of GenevaunigeBlack