Natural Language Processing for Swiss German Dialects

Scherrer, Yves

Most Natural Language Processing (NLP) applications focus on standardized, written language varieties. From a practical point of view, this focus is understandable: such systems are most likely to be used on written data of a standard variety, and this kind of data is also most easily available for training and parametrizing NLP systems. However, in many regions of the world, the linguistic reality is somewhat more complex: many speakers use some kind of non-standard language variety -- mostly in speech, but sometimes also in writing. Non-standard lects are subject to continuous variation along the dialectal and sociolinguistic level. From a methodological as well as a practical point of view, it is therefore interesting to include findings of variational linguistics in existing NLP methods. Our work focuses on Swiss German dialects. The German-speaking part of Switzerland has been subject to more than a century of dialectological research that has resulted in dialect atlases, grammars and lexicons. Today, the dialects represent the default variety of oral communication (Standard German is only used for writing). Recently, dialect writing has also become popular in electronic media. This evolution justifies the development of dialect NLP tools, and at the same time provides us with data to validate them. We will present two prototypes of NLP applications: machine translation from Standard German to Swiss German dialects, and Swiss German dialect parsing. Both applications share two basic assumptions. First, they largely rely on existing Standard German models to capitalize on the larger resource pool of a written, standardized language. Second, they conceive Swiss German neither as one homogeneous language variety, nor as a finite number of distinct dialects, but rather as a continuum of varieties that share some characteristics. NLP applications traditionally consist of a set of rules -- grammar rules in parsing, transfer rules in machine translation. In our models, these rules are probabilistic, and the probability of a rule not only depends on the grammatical context, but also on the geographical location of the dialect to be treated. Therefore, each rule is associated with a map that defines its probability distribution over the Swiss German dialect landscape. This conception leads to some practical issues of map digitization and interpolation that will also be discussed.

Archive ouverte UNIGE

Natural Language Processing for Swiss German Dialects

Technical informations