UNIGE document Conference Presentation
previous document  unige:90850  next document
add to browser collection
Title

Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German

Authors
Samardžić, Tanja
Glaser, Elvira
Presented at 6th Days of Swiss Linguistics. Genève. 2016
Abstract To study and automatically process Swiss German, it is necessary to resolve the issue of variation in the written representation of words. Due to the lack of written tradition and to the considerable regional variation, Swiss German writing is highly inconsistent, making it hard to establish identity between lexical items that are felt like “the same word”. This poses a problem for any task that requires establishing lexical identities, such as efficient corpus querying for linguistic research, semantic processing, and information retrieval. In the context of building the general-purpose electronic corpus ArchiMob, we have chosen to create an additional annotation layer that maps the original word forms to unified normalised representations. In this paper, we argue that these normalised representations can be induced in a semi-automatic fashion using techniques from machine translation. A lexical unit can be pronounced, and therefore transcribed, in various ways, due to dialectal variation, intra-speaker variation, code-switching or occasional transcription errors. In order to establish lexical identities between the items felt like “the same word”, the transcribed texts need to be normalised. We propose an approach to automatic normalisation that casts the task as simplified machine translation from inconsistently written texts to a unified representation. The resulting normalisation is treated as word-level annotation which is internally used for executing search queries, but is not intended to be presented to human users. Normalisation is carried out manually on a set of six documents, which serve as training and development data. An important feature of the particular normalised representation implemented in our work is that it diverges from both standard German and Swiss German. Many Swiss German lexical items do not have any etymologically related standard counterparts. We chose to normalise them using a convenient, etymologically motivated common construction. Thus, öpper is normalised as etwer, töff as töff, and gheie as geheien. Standard German conventions regarding word boundaries are often not applicable to Swiss German, where articles and pronouns tend to be cliticised. In these cases, we decided to keep the standard word boundaries whenever this was possible. Thus, hettemers is normalised as hätten wir es, bimene as bei einem. The six manually normalised documents are used as training data to automatically predict normalisation candidates for the following documents. The automatic processing is intended to speed up annotating our corpus, but also to replace manual annotation of new data that is not part of the corpus. Developing an automatic approach, however, is not trivial because the mappings between the transcriptions and their corresponding normalisations need to be learned on a small and extremely sparse data set. We need to be able to fit the training data, but also learn to generalise beyond the cases seen in the training set. The core of our approach is to distinguish four classes of words based on the distribution of their normalisations in the training data, and to apply an appropriate normalisation technique to each class. The words associated with only one or one predominant normalization are best treated using word-to-word translation. To address the words associated with multiple normalisations none of which is predominant, we train a trigram language model. Finally, to address new words that have not been seen at all in the training data, we train a full character-based statistical machine translation system (Vilar et al. 2007, Tiedemann 2009). We show that the combination of the methods gives better results than any of them individually, allowing us to obtain a relatively good automatic normalisation of a wide range of variants in Swiss German using a small training set.
Full text
Presentation (Author postprint) (743 Kb) - public document Free access
Structures
Research group Laboratoire d'Analyse et de Traitement du Langage (LATL)
Citation
(ISO format)
SCHERRER, Yves, SAMARDŽIĆ, Tanja, GLASER, Elvira. Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German. In: 6th Days of Swiss Linguistics. Genève. 2016. https://archive-ouverte.unige.ch/unige:90850

85 hits

21 downloads

Update

Deposited on : 2017-01-04

Export document
Format :
Citation style :