Modernising historical Slovene words

Scherrer, Yves; Erjavec, Tomaž

doi:10.1017/S1351324915000236

Scientific article

English

Modernising historical Slovene words

ContributorsScherrer, Yves; Erjavec, Tomaž

Published inNatural language engineering, vol. 22, no. 6, p. 881-905

Publication date2016

Abstract

We propose a language-independent word normalisation method and exemplify it on modernising historical Slovene words. Our method relies on character-level statistical machine translation (CSMT) and uses only shallow knowledge. We present relevant data on historical Slovene, consisting of two (partially) manually annotated corpora and the lexicons derived from these corpora, containing historical word–modern word pairs. The two lexicons are disjoint, with one serving as the training set containing 40,000 entries, and the other as a test set with 20,000 entries. The data spans the years 1750–1900, and the lexicons are split into fifty-year slices, with all the experiments carried out separately on the three time periods. We perform two sets of experiments. In the first one – a supervised setting – we build a CSMT system using the lexicon of word pairs as training data. In the second one – an unsupervised setting – we simulate a scenario in which word pairs are not available. We propose a two-step method where we first extract a noisy list of word pairs by matching historical words with cognate modern words, and then train a CSMT system on these pairs. In both sets of experiments, we also optionally make use of a lexicon of modern words to filter the modernisation hypotheses. While we show that both methods produce significantly better results than the baselines, their accuracy and which method works best strongly correlates with the age of the texts, meaning that the choice of the best method will depend on the properties of the historical language which is to be modernised. As an extrinsic evaluation, we also compare the quality of part-of-speech tagging and lemmatisation directly on historical text and on its modernised words. We show that, depending on the age of the text, annotation on modernised words also produces significantly better results than annotation on the original text.

Affiliation entities

Research groups

Laboratoire d'Analyse et de Traitement du Langage (LATL)

Citation (ISO format)

SCHERRER, Yves, ERJAVEC, Tomaž. Modernising historical Slovene words. In: Natural language engineering, 2016, vol. 22, n° 6, p. 881–905. doi: 10.1017/S1351324915000236

Article (Accepted version)

Identifiers

PID : unige:82305
DOI : 10.1017/S1351324915000236

Additional URL for this publicationhttps://www.cambridge.org/core/journals/natural-language-engineering/article/modernising-historical-slovene-words/147A3522D468C677E78612D35318F3E2

Journal ISSN1351-3249

784views

532downloads

Creation01/04/2016 12:22:00

First validation01/04/2016 12:22:00

Update time15/03/2023 00:14:49

Status update15/03/2023 00:14:49

Last indexation31/10/2024 03:05:31

Archive ouverte UNIGE

Modernising historical Slovene words

Technical informations