UNIGE document Chapitre d'actes
previous document  unige:90846  next document
add to browser collection
Title

Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation

Authors
Ljubešić, Nikola
Published in Dipper, S. Proceedings of the 13th Conference on Natural Language Processing (KONVENS). Bochum (Germany) - 2016 - Bochum. 2016
Collection Bochumer Linguistische Arbeitsberichte; 16
Abstract The Swiss German dialect corpus ArchiMob poses great challenges for NLP and corpus linguistic research due to the massive amount of variation found in the transcriptions: dialectal variation is combined with intra-speaker variation and with transcriber inconsistencies. This variation is reduced through the addition of a normalisation layer. In this paper, we propose to use character-level machine translation to learn the normalisation process. We show that a character-level machine translation system trained on pairs of segments (not pairs of words) and including multiple language models is able to achieve up to 90.46% of word normalisation accuracy, an error reduction of 45% over a strong baseline and of 34% over a heterogeneous system proposed by Samardzic et al. (2015).
Full text
Structures
Research group Laboratoire d'Analyse et de Traitement du Langage (LATL)
Citation
(ISO format)
SCHERRER, Yves, LJUBEŠIĆ, Nikola. Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation. In: Dipper, S. (Ed.). Proceedings of the 13th Conference on Natural Language Processing (KONVENS). Bochum (Germany). Bochum : [s.n.], 2016. (Bochumer Linguistische Arbeitsberichte; 16) https://archive-ouverte.unige.ch/unige:90846

77 hits

34 downloads

Update

Deposited on : 2017-01-04

Export document
Format :
Citation style :