Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German

Scherrer, Yves; Samardžić, Tanja; Glaser, Elvira

To study and automatically process Swiss German, it is necessary to resolve the issue of variation in the written representation of words. Due to the lack of written tradition and to the considerable regional variation, Swiss German writing is highly inconsistent, making it hard to establish identity between lexical items that are felt like “the same word”. This poses a problem for any task that requires establishing lexical identities, such as efficient corpus querying for linguistic research, semantic processing, and information retrieval. In the context of building the general-purpose electronic corpus ArchiMob, we have chosen to create an additional annotation layer that maps the original word forms to unified normalised representations. In this paper, we argue that these normalised representations can be induced in a semi-automatic fashion using techniques from machine translation. A lexical unit can be pronounced, and therefore transcribed, in various ways, due to dialectal variation, intra-speaker variation, code-switching or occasional transcription errors. In order to establish lexical identities between the items felt like “the same word”, the transcribed texts need to be normalised. We propose an approach to automatic normalisation that casts the task as simplified machine translation from inconsistently written texts to a unified representation. The resulting normalisation is treated as word-level annotation which is internally used for executing search queries, but is not intended to be presented to human users. Normalisation is carried out manually on a set of six documents, which serve as training and development data. An important feature of the particular normalised representation implemented in our work is that it diverges from both standard German and Swiss German. Many Swiss German lexical items do not have any etymologically related standard counterparts. We chose to normalise them using a convenient, etymologically motivated common construction. Thus, öpper is normalised as etwer, töff as töff, and gheie as geheien. Standard German conventions regarding word boundaries are often not applicable to Swiss German, where articles and pronouns tend to be cliticised. In these cases, we decided to keep the standard word boundaries whenever this was possible. Thus, hettemers is normalised as hätten wir es, bimene as bei einem. The six manually normalised documents are used as training data to automatically predict normalisation candidates for the following documents. The automatic processing is intended to speed up annotating our corpus, but also to replace manual annotation of new data that is not part of the corpus. Developing an automatic approach, however, is not trivial because the mappings between the transcriptions and their corresponding normalisations need to be learned on a small and extremely sparse data set. We need to be able to fit the training data, but also learn to generalise beyond the cases seen in the training set. The core of our approach is to distinguish four classes of words based on the distribution of their normalisations in the training data, and to apply an appropriate normalisation technique to each class. The words associated with only one or one predominant normalization are best treated using word-to-word translation. To address the words associated with multiple normalisations none of which is predominant, we train a trigram language model. Finally, to address new words that have not been seen at all in the training data, we train a full character-based statistical machine translation system (Vilar et al. 2007, Tiedemann 2009). We show that the combination of the methods gives better results than any of them individually, allowing us to obtain a relatively good automatic normalisation of a wide range of variants in Swiss German using a small training set.

Archive ouverte UNIGE

Normalizing orthographic and dialectal variants in the ArchiMob corpus of spoken Swiss German

Technical informations