Part-of-speech tagging for regional languages and dialects : A generic approach based on unsupervised learning

Scherrer, Yves

Natural language processing for regional languages and dialects faces a certain number of challenges. First, the amount of electronically available written texts is small. Second, these data are most often not annotated, and spelling may not be standardized. These challenges can be – at least partially – overcome by using an etymologically closely related language with more resources (we call this language resourced language in the following). However, in most such configurations, parallel corpora are not available: since two closely related languages are generally mutually intelligible, the demand for translation (which constitutes the sourced of parallel corpora) is low. In this communication, we describe various experiments based on our part-of-speech tagging approach (see also Das & Petrov 2011). Concretely, we transfer existing part-of-speech annotations from the resourced language to the non-resourced (regional) language, withouth using bilingual (i.e., parallel) data during this process. This approach relies on two hypotheses. First, we assume that both languages share a lot of cognate pairs, i.e., word pairs that are formally similar and that are translations of each other. Second, we admit that the word order is similar in both languages, and that the set of POS tags is identical. Under these hypotheses, the POS tag of one word can be transferred to its translational equivalent in the other language. The proposed approach consists of two steps. In the first step, we induce a translation lexicon from monolingual corpora using techniques of unsupervised learning. For this, we propose a pipeline of several sub-steps that rely on different similarity measures: Language-independent formal similarity of words according to the BI-SIM measure (Kondrak & Dorr 2004) Language-dependent formal similarity of words, measured with a character-level machine translation system (Tiedemann 2009) Frequential similarity (Koehn & Knight 2002) Contextual similarity (Rapp 1999) In the second step, we assign to each word of the non-resourced language, the POS tag(s) observed with the resourced language word to which it has been linked in the first step. The resulting POS tag assignments can then be used to train a regular tagger. In the presentation, we will describe the different steps of our approach in some detail. We will also present the experiments that we have conducted with Romance, Germanic and Slavic language pairs, and for which we have obtained accuracy figures ranging between 59% and 91%. We will also discuss the reasons for the rather large variance observed in the results.

Archive ouverte UNIGE

Part-of-speech tagging for regional languages and dialects : A generic approach based on unsupervised learning

Technical informations